[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4123: Fix Version/s: 0.16.0 > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.15.0, 0.16.0 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 340,
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4123: Fix Version/s: (was: 0.15.0) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.16.0 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 340,
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4123: Fix Version/s: 0.15.0 > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.15.0 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 340, "numDeletes" :
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-4123: - Fix Version/s: 0.14.1 (was: 0.14.0) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null",
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4123: Fix Version/s: 0.14.0 (was: 0.13.0) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null",
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-4123: - Sprint: (was: ) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.0 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 340, "numDeletes" : >
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4123: - Sprint: 2023-01-09 (was: 0.13.0 Final Sprint) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.0 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" :
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-4123: -- Sprint: 2022/12/26 > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.0 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 340, "numDeletes" :
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhaojing Yu updated HUDI-4123: -- Fix Version/s: 0.13.0 (was: 0.12.1) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.13.0 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null",
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-4123: Sprint: (was: 2022/09/19) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 340,
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4123: -- Status: In Progress (was: Open) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null",
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4123: -- Sprint: 2022/09/19 (was: 2022/09/05) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null",
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4123: -- Status: Open (was: Patch Available) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null",
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4123: -- Sprint: 2022/09/05 (was: 2022/09/19) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null",
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4123: -- Sprint: 2022/09/19 (was: 2022/09/05) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null",
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4123: -- Sprint: 2022/09/05 (was: 2022/09/19) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null",
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4123: -- Sprint: 2022/09/19 (was: 2022/09/05) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null",
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4123: -- Reviewers: sivabalan narayanan > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null", "numWrites"
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4123: - Sprint: 2022/09/05 (was: 2022/08/22) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 340,
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4123: - Story Points: 1 > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 340, "numDeletes" : > 0,
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4123: -- Priority: Critical (was: Major) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null",
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-4123: -- Sprint: 2022/08/22 > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 340, "numDeletes" : >
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-4123: -- Status: Patch Available (was: In Progress) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 340,
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-4123: -- Status: In Progress (was: Open) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : > "2015/03/16/f3371308-8809-4644-baf6-c65c3fb86c8e-0_1-21-22_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 340,
[jira] [Updated] (HUDI-4123) HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint
[ https://issues.apache.org/jira/browse/HUDI-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-4123: -- Summary: HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint (was: HoodieDeltaStreamer trhows exception due to SqlSource return null checkpoint) > HoodieDeltaStreamer throws exception due to SqlSource return null checkpoint > > > Key: HUDI-4123 > URL: https://issues.apache.org/jira/browse/HUDI-4123 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > When use SqlSource: > ## Create hive source table > ```sql > create database test location '/test'; > create table test.test_source ( > id int, > name string, > price double, > dt string, > ts bigint > ); > insert into test.test_source values (105,'hudi', 10.0,'2021-05-05',100); > ``` > ## Use SqlSource > sql_source.properties > ``` > hoodie.datasource.write.recordkey.field=id > hoodie.datasource.write.partitionpath.field=dt > hoodie.deltastreamer.source.sql.sql.query = select * from test.test_source > hoodie.datasource.hive_sync.table=test_hudi_target > hoodie.datasource.hive_sync.database=hudi > hoodie.datasource.hive_sync.partition_fields=dt > hoodie.datasource.hive_sync.create_managed_table = true > hoodie.datasource.write.hive_style_partitioning=true > hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator > hoodie.datasource.hive_sync.use_jdbc=false > hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor > ``` > ```bash > spark-submit --conf "spark.sql.catalogImplementation=hive" \ > --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 > --executor-cores 2 --driver-memory 4G --driver-cores 2 \ > --principal spark/indata-10-110-105-163.indata@indata.com --keytab > /etc/security/keytabs/spark.service.keytab \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer > /usr/hdp/3.1.0.0-78/spark2/jars/hudi-utilities-bundle_2.11-0.12.0-SNAPSHOT.jar > \ > --props file:///opt/sql_source.properties \ > --target-base-path /hudi/test_hudi_target \ > --target-table test_hudi_target \ > --op BULK_INSERT \ > --table-type COPY_ON_WRITE \ > --source-ordering-field ts \ > --source-class org.apache.hudi.utilities.sources.SqlSource \ > --enable-sync \ > --checkpoint earliest \ > --allow-commit-on-no-checkpoint-change > ``` > Once executed, the hive source table can be successfully written to the Hudi > target table. > However, if it is executed multiple times, such as the second time, an > exception will be thrown: > ``` > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit : > "deltastreamer.checkpoint.reset_key" : "earliest" > ``` > The reason is that the value of `deltastreamer.checkpoint.reset_key` is > `earliest`,but `deltastreamer.checkpoint.key` is null, > According to the logic of the method `getCheckpointToResume`,Will throw this > exception. > I think since the value of `deltastreamer.checkpoint.reset_key` is null, The > value of `deltastreamer.checkpoint.key`should also be saved as null.This also > avoids this exception according to the logic of the method > `getCheckpointToResume` > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: Unable to > find previous checkpoint. Please double check if this table was indeed built > via delta streamer. Last Commit > :Option{val=[20220519162403646__commit__COMPLETED]}, Instants > :[[20220519162403646__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > "2016/03/15" : [ > { "fileId" : "6a1e0512-508a-4bdb-ad8f-200cda157ff0-0", "path" : > "2016/03/15/6a1e0512-508a-4bdb-ad8f-200cda157ff0-0_0-21-21_20220519162403646.parquet", > "prevCommit" : "null", "numWrites" : 342, "numDeletes" : > 0, "numUpdateWrites" : 0, "numInserts" : 342, > "totalWriteBytes" : 481336, "totalWriteErrors" : 0, "tempPath" : > null, "partitionPath" : "2016/03/15", "totalLogRecords" : 0, > "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, > "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, > "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, > "fileSizeInBytes" : 481336, "minEventTime" : null, "maxEventTime" > : null } > ], > "2015/03/16" : [ > { "fileId" : "f3371308-8809-4644-baf6-c65c3fb86c8e-0", "path" : >