ennox108 opened a new issue, #10863: URL: https://github.com/apache/hudi/issues/10863
I am trying to run a Flink job to get data from SQL server to S3. I am doing offline compaction but whenever it is triggered I end up having less records than before the compaction. Based on the commits it looks like it is ignoring data in the old parquet files. The compaction is triggered using /bin/flink run-application -t yarn-application -Dyarn.application.name=CompactionCas -Dyarn.application.queue=casualty -Djobmanager.memory.process.size=16384m -Dtaskmanager.memory.process.size=16384m -Dtaskmanager.memory.managed.fraction=0.05 -Dtaskmanager.memory.task.off-heap.size=512m -Dtaskmanager.memory.framework.off-heap.size=512m -c org.apache.hudi.sink.compact.HoodieFlinkCompactor /lib/hudi-flink1.17-bundle-0.13.1-amzn-0.jar --path s3://<bucket>/data/casualty/raw/table-name --compaction-max-memory 2048 Here are the configs I use 'connector' = 'hudi'," 'write.tasks' = '" + loadTasks + "'," 'path' = '" + sinkLocation + "'," 'hoodie.fs.atomic_creation.support' = 's3'," 'table.type' = 'MERGE_ON_READ'," 'write.rate.limit' = '0'," 'precombine.field' = 'lsn'," 'metadata.enabled' = 'true'," 'index.type' = 'BUCKET'," 'hoodie.bucket.index.hash.field' = '" + indexField + "'," 'hoodie.bucket.index.num.buckets' = '" + indexBucketNum + "'," 'hoodie.database.name' = '" + dbName + "'," 'hoodie.table.name' = '" + tableName + "'," 'hoodie.datasource.write.hive_style_partitioning' = 'false'," 'hive_sync.support_timestamp' = 'true'," 'hive_sync.enabled' = 'true'," 'hive_sync.mode' = 'hms'," 'hive_sync.metastore.uris' = '" + hiveMetaURI + "'," 'hive_sync.db' = '" + dbName + "'," 'hive_sync.table' = '" + tableName + "'," 'hoodie.embed.timeline.server' = 'false'," 'compaction.schedule.enabled' = 'true'," 'compaction.async.enabled' = 'false'," 'compaction.trigger.strategy' = 'num_commits'," 'compaction.delta_commits' = '1'," 'clean.retain_commits' = '5'," 'archive.max_commits' = '15'," 'archive.min_commits' = '10')" Spark - 3.4.0 Flink - 1.17.0 Hive - 3.1.3 EMR - 6.12.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org