[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2024-06-10 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4123:

Fix Version/s: 0.16.0

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0, 0.16.0
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 340,       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2024-06-10 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4123:

Fix Version/s: (was: 0.15.0)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 340,       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2024-03-28 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4123:

Fix Version/s: 0.15.0

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 340,       "numDeletes" : 

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2023-10-04 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-4123:
-
Fix Version/s: 0.14.1
   (was: 0.14.0)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",  

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2023-02-07 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4123:

Fix Version/s: 0.14.0
   (was: 0.13.0)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2023-01-05 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-4123:
-
Sprint:   (was: )

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 340,       "numDeletes" : 
> 

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-12-20 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4123:
-
Sprint: 2023-01-09  (was: 0.13.0 Final Sprint)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-12-19 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4123:
--
Sprint: 2022/12/26

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 340,       "numDeletes" : 

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-10-01 Thread Zhaojing Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaojing Yu updated HUDI-4123:
--
Fix Version/s: 0.13.0
   (was: 0.12.1)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-09-26 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4123:

Sprint:   (was: 2022/09/19)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 340,       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-09-23 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4123:
--
Status: In Progress  (was: Open)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-09-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4123:
--
Sprint: 2022/09/19  (was: 2022/09/05)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-09-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4123:
--
Status: Open  (was: Patch Available)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-09-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4123:
--
Sprint: 2022/09/05  (was: 2022/09/19)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-09-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4123:
--
Sprint: 2022/09/19  (was: 2022/09/05)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-09-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4123:
--
Sprint: 2022/09/05  (was: 2022/09/19)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-09-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4123:
--
Sprint: 2022/09/19  (was: 2022/09/05)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-09-04 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4123:
--
Reviewers: sivabalan narayanan

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" 

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-08-20 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4123:
-
Sprint: 2022/09/05  (was: 2022/08/22)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 340,      

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-08-20 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4123:
-
Story Points: 1

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 340,       "numDeletes" : 
> 0, 

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-08-18 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4123:
--
Priority: Critical  (was: Major)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-08-17 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4123:
--
Sprint: 2022/08/22

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 340,       "numDeletes" : 
> 

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-08-17 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4123:
--
Status: Patch Available  (was: In Progress)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 340, 

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-08-17 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4123:
--
Status: In Progress  (was: Open)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
> "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 340,       

[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint

2022-08-17 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4123:
--
Summary: HoodieDeltaStreamer throws exception due to SqlSource return null 
checkpoint  (was: HoodieDeltaStreamer trhows exception due to SqlSource return 
null checkpoint)

> HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
> 
>
> Key: HUDI-4123
> URL: https://issues.apache.org/jira/browse/HUDI-4123
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When use SqlSource:
> ## Create hive source table
> ```sql
> create database test location '/test';
> create table test.test_source (
>   id int,
>   name string,
>   price double,
>   dt string,
>   ts bigint
> );
> insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100);
> ```
> ## Use SqlSource
> sql_source.properties
> ```
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.partitionpath.field=dt
> hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source
> hoodie.datasource.hive_sync.table=test_hudi_target
> hoodie.datasource.hive_sync.database=hudi
> hoodie.datasource.hive_sync.partition_fields=dt
> hoodie.datasource.hive_sync.create_managed_table = true
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> ```
> ```bash
> spark-submit --conf "spark.sql.catalogImplementation=hive" \
> --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 
> --executor-cores 2 --driver-memory 4G --driver-cores 2 \
> --principal spark/indata-10-110-105-163.indata@indata.com --keytab 
> /etc/security/keytabs/spark.service.keytab \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
> /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar
>  \
> --props file:///opt/sql_source.properties  \
> --target-base-path /hudi/test_hudi_target \
> --target-table test_hudi_target \
> --op BULK_INSERT \
> --table-type COPY_ON_WRITE \
> --source-ordering-field ts \
> --source-class org.apache.hudi.utilities.sources.SqlSource \
> --enable-sync  \
> --checkpoint earliest \
> --allow-commit-on-no-checkpoint-change
> ```
> Once executed, the hive source table can be successfully written to the Hudi 
> target table.
> However, if it is executed multiple times, such as the second time, an 
> exception will be thrown:
> ```
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit :
> "deltastreamer.checkpoint.reset_key" : "earliest"
> ```
> The reason is that the value of `deltastreamer.checkpoint.reset_key` is 
> `earliest`,but `deltastreamer.checkpoint.key` is null,
> According to the logic of the method `getCheckpointToResume`,Will throw this 
> exception.
> I think since  the value of `deltastreamer.checkpoint.reset_key` is null, The 
> value of `deltastreamer.checkpoint.key`should also be saved as null.This also 
> avoids this exception according to the logic of the method 
> `getCheckpointToResume`
>  
>  
> org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to 
> find previous checkpoint. Please double check if this table was indeed built 
> via delta streamer. Last Commit 
> :Option{val=[20220519162403646__commit__COMPLETED]}, Instants 
> :[[20220519162403646__commit__COMPLETED]], CommitMetadata={
>   "partitionToWriteStats" : {
>     "2016/03/15" : [
> {       "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0",       "path" : 
> "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet",
>        "prevCommit" : "null",       "numWrites" : 342,       "numDeletes" : 
> 0,       "numUpdateWrites" : 0,       "numInserts" : 342,       
> "totalWriteBytes" : 481336,       "totalWriteErrors" : 0,       "tempPath" : 
> null,       "partitionPath" : "2016/03/15",       "totalLogRecords" : 0,      
>  "totalLogFilesCompacted" : 0,       "totalLogSizeCompacted" : 0,       
> "totalUpdatedRecordsCompacted" : 0,       "totalLogBlocks" : 0,       
> "totalCorruptLogBlock" : 0,       "totalRollbackBlocks" : 0,       
> "fileSizeInBytes" : 481336,       "minEventTime" : null,       "maxEventTime" 
> : null     }
> ],
>     "2015/03/16" : [
> {       "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0",       "path" : 
>