ennox108 opened a new issue, #10863:
URL: https://github.com/apache/hudi/issues/10863

   I am trying to run a Flink job to get data from SQL server to S3. 
   
   I am doing offline compaction but whenever it is triggered I end up having 
less records than before the compaction. Based on the commits it looks like it 
is ignoring data in the old parquet files.
   
   The compaction is triggered using
   /bin/flink run-application -t yarn-application 
-Dyarn.application.name=CompactionCas -Dyarn.application.queue=casualty 
-Djobmanager.memory.process.size=16384m 
-Dtaskmanager.memory.process.size=16384m 
-Dtaskmanager.memory.managed.fraction=0.05 
-Dtaskmanager.memory.task.off-heap.size=512m 
-Dtaskmanager.memory.framework.off-heap.size=512m -c 
org.apache.hudi.sink.compact.HoodieFlinkCompactor 
/lib/hudi-flink1.17-bundle-0.13.1-amzn-0.jar --path 
s3://<bucket>/data/casualty/raw/table-name --compaction-max-memory 2048
   
   
   Here are the configs I use
   
   'connector' = 'hudi',"
   'write.tasks' = '" + loadTasks + "',"
   'path' = '" + sinkLocation + "',"
   'hoodie.fs.atomic_creation.support' = 's3',"
   'table.type' = 'MERGE_ON_READ',"
   'write.rate.limit' = '0',"
   'precombine.field' = 'lsn',"
   'metadata.enabled' = 'true',"
   'index.type' = 'BUCKET',"
   'hoodie.bucket.index.hash.field' = '" + indexField + "',"
   'hoodie.bucket.index.num.buckets' = '" + indexBucketNum + "',"
   'hoodie.database.name' = '" + dbName + "',"
   'hoodie.table.name' = '" + tableName + "',"
   'hoodie.datasource.write.hive_style_partitioning' = 'false',"
   'hive_sync.support_timestamp' = 'true',"
   'hive_sync.enabled' = 'true',"
   'hive_sync.mode' = 'hms',"
   'hive_sync.metastore.uris' = '" + hiveMetaURI + "',"
   'hive_sync.db' = '" + dbName + "',"
   'hive_sync.table' = '" + tableName + "',"
   'hoodie.embed.timeline.server' = 'false',"
   'compaction.schedule.enabled' = 'true',"
   'compaction.async.enabled' = 'false',"
   'compaction.trigger.strategy' = 'num_commits',"
   'compaction.delta_commits' = '1',"
   'clean.retain_commits' = '5',"
   'archive.max_commits' = '15',"
   'archive.min_commits' = '10')"
   
   
   
   Spark - 3.4.0
   Flink - 1.17.0
   Hive - 3.1.3
   EMR - 6.12.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to