[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-05-17 Thread Charlie Briggs (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538175#comment-17538175
 ] 

Charlie Briggs commented on HUDI-3242:
--

We've also been confused by this behaviour (Hudi 0.10.1, but applies to 0.11.0).

* The checkpointing mechanism, with "deltastreamer.checkpoint.reset_key" is not 
well described in the documentation
* We assume that to load a given input using Deltastreamer, we can run with 
`–checkpoint 0` (I think this is intuative, maybe it is only intuitive for me?)
* When setting `–checkpoint 0` and `deltastreamer.checkpoint.reset_key = 0`, 
the `–checkpoint` flag is actually ignored, in Hudi 0.10.1 at least there are 
no logs of this behaviour being chosen

I would expect that regardless of past commits, or epoch times, if I explicitly 
specific a checkpoint, it should be used by Deltastreamer.

For reference our use case is this:

* We have two Deltastreamer jobs, one which loads at an interval from S3, one 
which loads daily from a separate S3 bucket (both parquet inputs)
* We use Deltastreamer for both jobs as schema validation fails when enabling 
it and using Deltastreamer/Spark writer concurrently 
(https://issues.apache.org/jira/browse/HUDI-2755) We do not currently wish to 
run without schema validation as we've had issues breaking tables in the past 
when disabling this.
* We can't disable checkpointing on either of the two jobs or copy information 
from the previous commit (we previously used the support described here for 
this https://issues.apache.org/jira/browse/HUDI-2719)

As such, we instead try to write first to an ephemeral location, and then load 
from there using `–checkpoint 0` to force all data in the input to be loaded, 
regardless of write times.
_Fortunately this checkpointing behaviour described above is not currently 
necessary as we only allow one writer in parallel, so the ephemeral data is 
guaranteed to be newer than the previous checkpoint, but it is confusing._

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Minor
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-23 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480717#comment-17480717
 ] 

sivabalan narayanan commented on HUDI-3242:
---

based on offline discussions, heard the user got unblocked for now. Reducing 
the priority to non critical for now. 

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479835#comment-17479835
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

Input: monthly partitions

partition=2021/01

file1_1 - timestamp1
file2_1 - timestamp2
file3_1 - timestamp3

partition=2021/02

file1_2 - timestamp1
file2_2 - timestamp2
file3_2 - timestamp3

Now I want to run Deltastreamer partition after partition to create Hudi table

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479833#comment-17479833
 ] 

sivabalan narayanan commented on HUDI-3242:
---

I don't understand this statement of yours "Also few of my unloaded datasets 
have non linear timestamps across partitions and I create the hudi table 
partition after partition and set checkpoint to 0." sorry. can you please 
clarify.

I am trying to understand whats your intention to explicitly set checkpoint 
value to 0? 

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479696#comment-17479696
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

I have a dataset partitioned but have very small files, and JSON format. and 
need to create multiple Hudi Tables out of it.
So What I do is process it by a spark program. converts it into Parquet but big 
files.
As the dataset is large. the partitions are processed in parallel. and the file 
stamps for the files can be in any order. Operation is only 'INSERT'.
I used to create the table by setting checkpoint 0 before even in 0.10.0 
release.

How can I do that now?

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479685#comment-17479685
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

Let me try that.

Yes, I am aware that checkpoint need not be set for initial creation, It was 
just my automation passing it by mistake found out the issue.

Also few of my unloaded datasets have non linear timestamps across partitions 
and I create the hudi table partition after partition and set checkpoint to 0.

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479679#comment-17479679
 ] 

sivabalan narayanan commented on HUDI-3242:
---

thanks.

Here is what is happening. 

We set the user configured checkpoint in checkpoint reset key. so that in 
subsequent sync() in continuous mode, we will ignore the user configured 
checkpoint and continue with last checkpoint (checkpoint key) from commit 
metadata. Essentially if your continuous mode makes 10 commits, for 1st sync(), 
checkpoint should be picked from what the user have configured. But for2nd 
commit, checkpoint should be picked up from last commit metadata. 

But if at all you want to do backfill w/ exactly the same checkpoint value that 
you had set when you started a continuous mode deltastreamer, it runs into this 
issue. 

 

Can you start deltastreamer w/o setting any checkpoint. and once a couple of 
commits go through, you can stop deltastreamer and then start w/ any new 
checkpoint (for eg 0 in your use) required for your backfill case. 

 

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479650#comment-17479650
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

./20220114204813242.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220115020705625.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220115024513109.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220116191900632.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220116192332086.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220120081554816.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220120081925787.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220120082326203.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220120082717393.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220120083124273.commit:    "deltastreamer.checkpoint.reset_key" : "0",

./20220120085228007.commit:    "deltastreamer.checkpoint.reset_key" : "0",

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479644#comment-17479644
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

./20220114204813242.commit:    "deltastreamer.checkpoint.key" : "1642193264000"

./20220115020705625.commit:    "deltastreamer.checkpoint.key" : "1642204844000"

./20220115024513109.commit:    "deltastreamer.checkpoint.key" : "1642214688000"

./20220116191900632.commit:    "deltastreamer.checkpoint.key" : "1642291261000"

./20220116192332086.commit:    "deltastreamer.checkpoint.key" : "1642360982000"

./20220120081554816.commit:    "deltastreamer.checkpoint.key" : "1642377603000"

./20220120081925787.commit:    "deltastreamer.checkpoint.key" : "1642464065000"

./20220120082326203.commit:    "deltastreamer.checkpoint.key" : "1642550413000"

./20220120082717393.commit:    "deltastreamer.checkpoint.key" : "1642636804000"

./20220120083124273.commit:    "deltastreamer.checkpoint.key" : "1642667425000"

./20220120085228007.commit:    "deltastreamer.checkpoint.key" : "1642668697000"

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479640#comment-17479640
 ] 

sivabalan narayanan commented on HUDI-3242:
---

hmmm, got it. I have a suspicion around how we set the checkpoint reset key in 
commit metadata. But can you post the values for these two entries 
(deltastreamer.checkpoint.key and deltastreamer.checkpoint.reset_key) in all of 
your completed commit files. ordered by file's mod time. 

 

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479633#comment-17479633
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

5) checkpoint from the last commit before this replace commit

{
  "partitionToWriteStats" : {
    "date=2022/01/19" : [ {
      "fileId" : "4c5a04e9-9288-4000-9909-4c2640c5b779-0",
      "path" : 
"date=2022/01/19/4c5a04e9-9288-4000-9909-4c2640c5b779-0_0-29-893_20220120085228007.parquet",
      "prevCommit" : "20220120083230209",
      "numWrites" : 297106,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 3452,
      "totalWriteBytes" : 12468290,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "date=2022/01/19",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 12468290,
      "minEventTime" : null,
      "maxEventTime" : null
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : 
"\{\"type\":\"record\",\"name\":\"hoodie_source\",\"namespace\":\"hoodie.source\",\"fields\":[REDACTED]}",
    "deltastreamer.checkpoint.reset_key" : "0",
    "deltastreamer.checkpoint.key" : "1642668697000"
  },
  "operationType" : "UPSERT",
  "totalRecordsDeleted" : 0,
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 0,
  "totalUpsertTime" : 6371,
  "minAndMaxEventTime" : {
    "Optional.empty" : {
      "val" : null,
      "present" : false
    }
  },
  "writePartitionPaths" : [ "date=2022/01/19" ],
  "fileIdAndRelativePaths" : {
    "4c5a04e9-9288-4000-9909-4c2640c5b779-0" : 
"date=2022/01/19/4c5a04e9-9288-4000-9909-4c2640c5b779-0_0-29-893_20220120085228007.parquet"
  }
}

!Screen Shot 2022-01-20 at 1.36.48 PM.png!

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png, Screen Shot 2022-01-20 at 1.36.48 PM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479631#comment-17479631
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

Hi,

Initially I saw the behavior of not picking the files from the partition, but 
that dataset was differently produced. meaning source dataset is unloaded from 
Warehouse and files in different partitions do not have linear timestamps. So I 
thought that be might be cause


Now I am seeing this in all the tables, When passing checkpoint 0 to reprocess 
a partition, it is just skipping the partition. Only processing the current 
partition. So I found that checkpoint is ignored. I see it Deltastreamer logs 
and the job ends in 20 seconds

2) No clustering commits are pending, only one Deltastreamer is running and it 
successfully completed.. I can see the cluster commit on the timeline.

3) I can reproduce it consistently, not being able to backfill the tables 
currently.

4) Contents of replace commit

{
  "partitionToWriteStats" : {
    "date=2022/01/19" : [ {
      "fileId" : "c8b06d0b-1c8a-434e-b54a-15b6525b738a-0",
      "path" : 
"date=2022/01/19/c8b06d0b-1c8a-434e-b54a-15b6525b738a-0_0-78-996_20220120085344674.parquet",
      "prevCommit" : "null",
      "numWrites" : 297106,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 297106,
      "totalWriteBytes" : 12468914,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "date=2022/01/19",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 12468914,
      "minEventTime" : null,
      "maxEventTime" : null
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : 
"{\"type\":\"record\",\"name\":\"hoodie_source\",\"namespace\":\"hoodie.source\",\"fields\":[REDACTED]"
  },
  "operationType" : "CLUSTER",
  "partitionToReplaceFileIds" : {
    "date=2022/01/19" : [ "4c5a04e9-9288-4000-9909-4c2640c5b779-0" ]
  },
  "totalRecordsDeleted" : 0,
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 9815,
  "totalUpsertTime" : 0,
  "minAndMaxEventTime" : {
    "Optional.empty" : {
      "val" : null,
      "present" : false
    }
  },
  "writePartitionPaths" : [ "date=2022/01/19" ],
  "fileIdAndRelativePaths" : {
    "c8b06d0b-1c8a-434e-b54a-15b6525b738a-0" : 
"date=2022/01/19/c8b06d0b-1c8a-434e-b54a-15b6525b738a-0_0-78-996_20220120085344674.parquet"
  }
}

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479621#comment-17479621
 ] 

sivabalan narayanan commented on HUDI-3242:
---

for example, this is my test dataset timeline. 
{code:java}
ls -ltr /tmp/hudi-deltastreamer-gh-mw/.hoodie                
total 168
drwxr-xr-x  2 nsb  wheel    64 Jan 20 11:51 archived
-rw-r--r--  1 nsb  wheel   488 Jan 20 11:51 hoodie.properties
-rw-r--r--  1 nsb  wheel     0 Jan 20 11:51 20220120115119588.commit.requested
-rw-r--r--  1 nsb  wheel     0 Jan 20 11:51 20220120115119588.inflight
-rw-r--r--  1 nsb  wheel  9715 Jan 20 11:51 20220120115119588.commit
-rw-r--r--  1 nsb  wheel     0 Jan 20 11:52 20220120115212861.commit.requested
-rw-r--r--  1 nsb  wheel     0 Jan 20 11:52 20220120115212861.inflight
-rw-r--r--  1 nsb  wheel  9793 Jan 20 11:52 20220120115212861.commit
-rw-r--r--  1 nsb  wheel     0 Jan 20 11:53 20220120115312770.commit.requested
-rw-r--r--  1 nsb  wheel     0 Jan 20 11:53 20220120115312770.inflight
-rw-r--r--  1 nsb  wheel  9793 Jan 20 11:53 20220120115312770.commit
-rw-r--r--  1 nsb  wheel     0 Jan 20 14:16 20220120141617399.commit.requested
-rw-r--r--  1 nsb  wheel     0 Jan 20 14:16 20220120141617399.inflight
-rw-r--r--  1 nsb  wheel  9901 Jan 20 14:16 20220120141617399.commit
-rw-r--r--  1 nsb  wheel  8748 Jan 20 14:16 
20220120141621547.replacecommit.requested
-rw-r--r--  1 nsb  wheel     0 Jan 20 14:16 
20220120141621547.replacecommit.inflight
-rw-r--r--  1 nsb  wheel  6818 Jan 20 14:16 20220120141621547.replacecommit
-rw-r--r--  1 nsb  wheel     0 Jan 20 14:17 20220120141739558.commit.requested
-rw-r--r--  1 nsb  wheel     0 Jan 20 14:17 20220120141739558.inflight
-rw-r--r--  1 nsb  wheel  9923 Jan 20 14:17 20220120141739558.commit {code}
and here are the checkpoints. 
{code:java}
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20220120115119588.commit:    
"deltastreamer.checkpoint.key" : "1642697445000"
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20220120115212861.commit:    
"deltastreamer.checkpoint.key" : "1642697505000"
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20220120115312770.commit:    
"deltastreamer.checkpoint.key" : "1642697571000"
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20220120141617399.commit:    
"deltastreamer.checkpoint.key" : "1642706158000"
/tmp/hudi-deltastreamer-gh-mw/.hoodie/20220120141739558.commit:    
"deltastreamer.checkpoint.key" : "1642706252000" {code}

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479618#comment-17479618
 ] 

sivabalan narayanan commented on HUDI-3242:
---

btw, would you mind posting the contents of 
20220120085344674__replacecommit__COMPLETED. our expectation is that, it should 
not contain any metadata for CHECKPOINT_KEY nor CHECKPOINT_RESET_KEY. 

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479616#comment-17479616
 ] 

sivabalan narayanan commented on HUDI-3242:
---

hey [~h7kanna] : let me try to understand the problem. 

I see you have reported 2 issues:
 # some files are missing to be read while querying (from 2nd partition in your 
example)
 # backfilling is not working. the checkpoint which you set is not being picked 
up.

Let's go one by one. 

(2): wondering how come your checkpoint in last commit metadata is 0. If you 
made some commits, checkpoint should have progressed to some non 0 value in my 
understanding. We need to identify the root cause for this. But if your 
checkpoint in your latest commit metadata is non 0, if you set checkpoint value 
to 0 when you restart your deltastreamer, it should been picked up. 

 

(1): Can you reproduce this consistently? I see you have clustering enabled. We 
had a [bug|https://github.com/apache/hudi/pull/4172] sometime back where if 
there was a pending clustering operation, some files were missed to be read 
while querying. Can you wait until clustering completes, and see if you could 
see all data. 

btw, how did you confirm that some files are missed to be read/ discovered? Did 
you check the sql tab in spark UI ? or is it based on the query results 
returned for your read query. 

 

 

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17479240#comment-17479240
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

So what I found from further debugging is that once the --checkpoint 0 is 
passed once to Deltastreamer, it will not pick it again if it is same.

[https://github.com/apache/hudi/blob/release-0.10.1/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L471]

I added log statements in a PR to master branch

This is what I got



22/01/20 04:04:28 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://datalake-hudi/v1/journals

22/01/20 04:04:28 INFO HoodieTableConfig: Loading table properties from 
s3a://datalake-hudi/v1/journals/.hoodie/hoodie.properties

22/01/20 04:04:28 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://datalake-hudi/v1/journals

22/01/20 04:04:29 INFO HoodieActiveTimeline: Loaded instants upto : 
Option\{val=[20220120085344674__replacecommit__COMPLETED]}

22/01/20 04:04:29 INFO DFSPathSelector: Using path selector 
org.apache.hudi.utilities.sources.helpers.DFSPathSelector

22/01/20 04:04:29 INFO HoodieDeltaStreamer: Delta Streamer running only single 
round

22/01/20 04:04:29 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://datalake-hudi/v1/journals

22/01/20 04:04:29 INFO HoodieTableConfig: Loading table properties from 
s3a://datalake-hudi/v1/journals/.hoodie/hoodie.properties

22/01/20 04:04:29 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://datalake-hudi/v1/journals

22/01/20 04:04:30 INFO HoodieActiveTimeline: Loaded instants upto : 
Option\{val=[20220120085344674__replacecommit__COMPLETED]}

22/01/20 04:04:30 INFO DeltaSync: *Checkpoint reset from metadata: 0*

22/01/20 04:04:30 INFO DeltaSync: *Checkpoint from config: 0*

22/01/20 04:04:30 INFO DeltaSync: *Checkpoint to resume from : 
Option\{val=1642668697000}*

22/01/20 04:04:30 INFO DFSPathSelector: Root path => 
s3a://datalake-hudi/v1/journals/year=2022/month=01/day=19 source limit => 
9223372036854775807

22/01/20 04:04:37 INFO DeltaSync: No new data, source checkpoint has not 
changed. Nothing to commit. Old checkpoint=(Option\{val=1642668697000}). New 
Checkpoint=(1642668697000)

22/01/20 04:04:37 INFO DeltaSync: Shutting down embedded timeline server

22/01/20 04:04:37 INFO HoodieDeltaStreamer: Shut down delta streamer

22/01/20 04:04:37 INFO SparkUI: Stopped Spark web UI at http://192.168.86.5:4040

22/01/20 04:04:37 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!

22/01/20 04:04:37 INFO MemoryStore: MemoryStore cleared

22/01/20 04:04:37 INFO BlockManager: BlockManager stopped

22/01/20 04:04:38 INFO BlockManagerMaster: BlockManagerMaster stopped

22/01/20 04:04:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!

22/01/20 04:04:38 INFO SparkContext: Successfully stopped SparkContext

22/01/20 04:04:38 INFO ShutdownHookManager: Shutdown hook called

22/01/20 04:04:38 INFO ShutdownHookManager: Deleting directory 
/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-acf0e21c-c48c-440c-86f8-72ff20bef349

22/01/20 04:04:38 INFO ShutdownHookManager: Deleting directory 
/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-b53eb674-0c67-4b68-8974-7ff706408686

22/01/20 04:04:38 INFO MetricsSystemImpl: Stopping s3a-file-system metrics 
system...

22/01/20 04:04:38 INFO MetricsSystemImpl: s3a-file-system metrics system 
stopped.

22/01/20 04:04:38 INFO MetricsSystemImpl: s3a-file-system metrics system 
shutdown complete.

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: hudi-on-call, sev:critical, user-support-issues
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of 

[jira] [Commented] (HUDI-3242) Checkpoint 0 is ignored -Partial parquet file discovery after the first commit

2022-01-16 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476923#comment-17476923
 ] 

Harsha Teja Kanna commented on HUDI-3242:
-

I am now seeing this for all the tables.. after running for few commits. at a 
partition, I had to backfill and the checkpoint 0 is ignored. 

> Checkpoint 0 is ignored -Partial parquet file discovery after the first commit
> --
>
> Key: HUDI-3242
> URL: https://issues.apache.org/jira/browse/HUDI-3242
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, writer-core
>Affects Versions: 0.10.1
> Environment: AWS
> EMR 6.4.0
> Spark 3.1.2
> Hudi - 0.10.1-rc
>Reporter: Harsha Teja Kanna
>Priority: Blocker
>  Labels: sev:critical
> Attachments: Screen Shot 2022-01-13 at 2.40.55 AM.png, Screen Shot 
> 2022-01-13 at 2.55.35 AM.png
>
>
> Hi, I am testing release branch 0.10.1 as I needed few bug fixes from it.
> However, I see for a certain table. Only partial discovery of files happening 
> after the initial commit of the table.
> But if the second partition is given as input for the first commit, all the 
> files are getting discovered.
> First partition : 2021/01 has 744 files and all of them are discovered
> Second partition: 2021/02 has 762 files but only 72 are discovered.
> Checkpoint is set to 0. 
> No errors in the logs.
> {code:java}
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 30 \
> --driver-memory 32g \
> --executor-cores 5 \
> --executor-memory 32g \
> --num-executors 120 \
> --jars 
> s3://bucket/apps/datalake/jars/unused-1.0.0.jar,s3://bucket/apps/datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
> s3://bucket/apps/datalake/jars/hudi-0.10.0/hudi-utilities-bundle_2.12-0.10.0.jar
>  \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/datastream/v1/sessions_by_date \
> --target-table sessions_by_date \
> --transformer-class 
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --checkpoint 0 \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=1 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=false \
> --hoodie-conf hoodie.clustering.inline.max.commits=1 \
> --hoodie-conf 
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkSizeBasedClusteringPlanStrategy
>  \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=100 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=25000 \
> --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=sid,id \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=268435456 
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf 
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
>  \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions_by_date \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf 
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
>  \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf hoodie.datasource.write.recordkey.field=id,qid,aid \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.input.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf 
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf 
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/history/datastream/v1/sessions/2021/02
>  \
> --hoodie-conf 
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
>  \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT id, qid, aid, 
>