[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

sivabalan narayanan (Jira) Thu, 20 Jan 2022 13:19:06 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479679#comment-17479679
 ]


sivabalan narayanan commented on HUDI-3242:
-------------------------------------------

thanks.

Here is what is happening. 

We set the user configured checkpoint in checkpoint reset key. so that in 
subsequent sync() in continuous mode, we will ignore the user configured 
checkpoint and continue with last checkpoint (checkpoint key) from commit 
metadata. Essentially if your continuous mode makes 10 commits, for 1st sync(), 
checkpoint should be picked from what the user have configured. But for2nd 
commit, checkpoint should be picked up from last commit metadata. 

But if at all you want to do backfill w/ exactly the same checkpoint value that 
you had set when you started a continuous mode deltastreamer, it runs into this 
issue. 

 

Can you start deltastreamer w/o setting any checkpoint. and once a couple of 
commits go through, you can stop deltastreamer and then start w/ any new 
checkpoint (for eg 0 in your use) required for your backfill case. 

 

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> ------------------------------------------------------------------------------
>
>                 Key: HUDI-3242
>                 URL: https://issues.apache.org/jira/browse/HUDI-3242
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: spark, writer-core
>    Affects Versions: 0.10.1
>         Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>            Reporter: Harsha Teja Kanna
>            Assignee: sivabalan narayanan
>            Priority: Critical
>              Labels: hudi-on-call, sev:critical, user-support-issues
>         Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=1000000 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=250000000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
> to_timestamp(timestamp) as timestamp, sid, 
> date_format(to_timestamp(timestamp), 'yyyy/MM/dd') AS date FROM <SRC> a \"" \
> --hoodie-conf hoodie.file.listing.parallelism=256 \
> --hoodie-conf hoodie.finalize.write.parallelism=256 \
> --hoodie-conf 
> hoodie.generate.consistent.timestamp.logical.for.key.generator=true \
> --hoodie-conf hoodie.insert.shuffle.parallelism=1000 \
> --hoodie-conf hoodie.metadata.enable=true \
> --hoodie-conf hoodie.metadata.metrics.enable=true \
> --hoodie-conf 
> hoodie.metrics.cloudwatch.metric.prefix=emr.datalake-service.prd.insert.sessions_by_date
>  \
> --hoodie-conf hoodie.metrics.on=true \
> --hoodie-conf hoodie.metrics.reporter.type=CLOUDWATCH \
> --hoodie-conf hoodie.parquet.block.size=268435456 \
> --hoodie-conf hoodie.parquet.compression.codec=snappy \
> --hoodie-conf hoodie.parquet.max.file.size=268435456 \
> --hoodie-conf hoodie.parquet.small.file.limit=250000000 {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

Reply via email to