[I] Spark SQL UPDATE and DELETE should write record positions [hudi]

via GitHub Sun, 30 Nov 2025 05:38:54 -0800


hudi-bot opened a new issue, #17309:
URL: https://github.com/apache/hudi/issues/17309


   Though there is no read and write error, Spark SQL UPDATE and DELETE do not 
write record positions to the log files.
   {code:java}
   spark-sql (default)> CREATE TABLE testing_positions.table2 (
                      >     ts BIGINT,
                      >     uuid STRING,
                      >     rider STRING,
                      >     driver STRING,
                      >     fare DOUBLE,
                      >     city STRING
                      > ) USING HUDI
                      > LOCATION 
'file:///Users/ethan/Work/tmp/hudi-1.0.0-testing/positional/table2'
                      > TBLPROPERTIES (
                      >   type = 'mor',
                      >   primaryKey = 'uuid',
                      >   preCombineField = 'ts'
                      > )
                      > PARTITIONED BY (city);
   24/11/16 12:03:26 WARN TableSchemaResolver: Could not find any data file 
written for commit, so could not get schema for table 
file:/Users/ethan/Work/tmp/hudi-1.0.0-testing/positional/table2
   Time taken: 0.4 seconds
   spark-sql (default)> INSERT INTO testing_positions.table2
                      > VALUES
                      > 
(1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
                      > 
(1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70
 ,'san_francisco'),
                      > 
(1695046462179,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90
 ,'san_francisco'),
                      > 
(1695332066204,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'),
                      > 
(1695516137016,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo'
    ),
                      > 
(1695376420876,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40
 ,'sao_paulo'    ),
                      > 
(1695173887231,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06
 ,'chennai'      ),
                      > 
(1695115999911,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');
   24/11/16 12:03:26 WARN TableSchemaResolver: Could not find any data file 
written for commit, so could not get schema for table 
file:/Users/ethan/Work/tmp/hudi-1.0.0-testing/positional/table2
   24/11/16 12:03:26 WARN TableSchemaResolver: Could not find any data file 
written for commit, so could not get schema for table 
file:/Users/ethan/Work/tmp/hudi-1.0.0-testing/positional/table2
   24/11/16 12:03:29 WARN log: Updating partition stats fast for: table2_ro
   24/11/16 12:03:29 WARN log: Updated size to 436166
   24/11/16 12:03:29 WARN log: Updating partition stats fast for: table2_ro
   24/11/16 12:03:29 WARN log: Updating partition stats fast for: table2_ro
   24/11/16 12:03:29 WARN log: Updated size to 436185
   24/11/16 12:03:29 WARN log: Updated size to 436386
   24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2_rt
   24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2_rt
   24/11/16 12:03:30 WARN log: Updated size to 436166
   24/11/16 12:03:30 WARN log: Updated size to 436386
   24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2_rt
   24/11/16 12:03:30 WARN log: Updated size to 436185
   24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2
   24/11/16 12:03:30 WARN log: Updated size to 436166
   24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2
   24/11/16 12:03:30 WARN log: Updated size to 436386
   24/11/16 12:03:30 WARN log: Updating partition stats fast for: table2
   24/11/16 12:03:30 WARN log: Updated size to 436185
   24/11/16 12:03:30 WARN HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
   24/11/16 12:03:30 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout 
does not exist
   24/11/16 12:03:30 WARN HiveConf: HiveConf of name hive.stats.retries.wait 
does not exist
   Time taken: 4.843 seconds
   spark-sql (default)> 
                      > SET hoodie.merge.small.file.group.candidates.limit = 0;
   hoodie.merge.small.file.group.candidates.limit    0
   Time taken: 0.018 seconds, Fetched 1 row(s)
   spark-sql (default)> 
                      > UPDATE testing_positions.table2 SET fare = 20.0 WHERE 
rider = 'rider-A';
   24/11/16 12:03:31 WARN SparkStringUtils: Truncated the string representation 
of a plan since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
   24/11/16 12:03:32 WARN HoodieFileIndex: Data skipping requires both Metadata 
Table and at least one of Column Stats Index, Record Level Index, or Functional 
Index to be enabled as well! (isMetadataTableEnabled = false, 
isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, 
isFunctionalIndexEnabled = false, isBucketIndexEnable = false, 
isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)
   24/11/16 12:03:32 WARN HoodieDataBlock: There are records without valid 
positions. Skip writing record positions to the data block header.
   24/11/16 12:03:34 WARN HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
   24/11/16 12:03:34 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout 
does not exist
   24/11/16 12:03:34 WARN HiveConf: HiveConf of name hive.stats.retries.wait 
does not exist
   Time taken: 5.545 seconds
   spark-sql (default)> 
                      > DELETE FROM testing_positions.table2 WHERE uuid = 
'e3cf430c-889d-4015-bc98-59bdce1e530c';
   24/11/16 12:03:37 WARN HoodieFileIndex: Data skipping requires both Metadata 
Table and at least one of Column Stats Index, Record Level Index, or Functional 
Index to be enabled as well! (isMetadataTableEnabled = false, 
isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, 
isFunctionalIndexEnabled = false, isBucketIndexEnable = false, 
isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)
   24/11/16 12:03:37 WARN HoodiePositionBasedFileGroupRecordBuffer: No record 
position info is found when attempt to do position based merge.
   24/11/16 12:03:37 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling 
back to key based merge for Read
   24/11/16 12:03:38 WARN HoodieDeleteBlock: There are delete records without 
valid positions. Skip writing record positions to the delete block header.
   24/11/16 12:03:39 WARN HiveConf: HiveConf of name 
hive.internal.ss.authz.settings.applied.marker does not exist
   24/11/16 12:03:39 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout 
does not exist
   24/11/16 12:03:39 WARN HiveConf: HiveConf of name hive.stats.retries.wait 
does not exist
   Time taken: 2.992 seconds
   spark-sql (default)> 
                      > select * from testing_positions.table2;
   24/11/16 12:03:41 WARN HoodiePositionBasedFileGroupRecordBuffer: No record 
position info is found when attempt to do position based merge.
   24/11/16 12:03:41 WARN HoodiePositionBasedFileGroupRecordBuffer: No record 
position info is found when attempt to do position based merge.
   24/11/16 12:03:41 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling 
back to key based merge for Read
   24/11/16 12:03:41 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling 
back to key based merge for Read
   20241116120326527    20241116120326527_0_0    
1dced545-862b-4ceb-8b43-d2a568f6616b    city=san_francisco    
1ba64ef0-bba2-469e-8ef5-696f8cdbe141-0_0-186-338_20241116120326527.parquet    
16953320662041dced545-862b-4ceb-8b43-d2a568f6616b    rider-E    driver-O    
93.5    san_francisco
   20241116120326527    20241116120326527_0_1    
e96c4396-3fad-413a-a942-4cb36106d721    city=san_francisco    
1ba64ef0-bba2-469e-8ef5-696f8cdbe141-0_0-186-338_20241116120326527.parquet    
1695091554788e96c4396-3fad-413a-a942-4cb36106d721    rider-C    driver-M    
27.7    san_francisco
   20241116120326527    20241116120326527_0_2    
9909a8b1-2d15-4d3d-8ec9-efc48c536a00    city=san_francisco    
1ba64ef0-bba2-469e-8ef5-696f8cdbe141-0_0-186-338_20241116120326527.parquet    
16950464621799909a8b1-2d15-4d3d-8ec9-efc48c536a00    rider-D    driver-L    
33.9    san_francisco
   20241116120331896    20241116120331896_0_9    
334e26e9-8355-45cc-97c6-c31daf0df330    city=san_francisco    
1ba64ef0-bba2-469e-8ef5-696f8cdbe141-0    1695159649087    
334e26e9-8355-45cc-97c6-c31daf0df330    rider-A    driver-K    20.0    
san_francisco
   20241116120326527    20241116120326527_1_1    
7a84095f-737f-40bc-b62f-6b69664712d2    city=sao_paulo    
ba555452-0c3c-47dc-acc0-f90823e12408-0_1-186-339_20241116120326527.parquet    
1695376420876    7a84095f-737f-40bc-b62f-6b69664712d2    rider-G    driver-Q    
43.4    sao_paulo
   20241116120326527    20241116120326527_2_0    
3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04    city=chennai    
8dacb2f9-6901-4ab3-8139-697b51125f16-0_2-186-340_20241116120326527.parquet    
1695173887231    3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04    rider-I    driver-S    
41.06    chennai
   20241116120326527    20241116120326527_2_1    
c8abbe79-8d89-47ea-b4ce-4d224bae5bfa    city=chennai    
8dacb2f9-6901-4ab3-8139-697b51125f16-0_2-186-340_20241116120326527.parquet    
1695115999911    c8abbe79-8d89-47ea-b4ce-4d224bae5bfa    rider-J    driver-T    
17.85    chennai
   Time taken: 1.719 seconds, Fetched 7 row(s) {code}
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-8553
   - Type: Sub-task
   - Parent: https://issues.apache.org/jira/browse/HUDI-9107
   - Fix version(s):
     - 1.1.0
   
   
   ---
   
   
   ## Comments
   
   21/Nov/24 21:18;jonvex;I have verified this with the script:
   {code:java}
   SET hoodie.spark.sql.optimized.writes.enable = false;
   
   CREATE TABLE table2 (     ts BIGINT,     uuid STRING,     rider STRING,     
driver STRING,     fare DOUBLE,     city STRING ) USING HUDI LOCATION 
'file:///tmp/testpositions' TBLPROPERTIES (   type = 'mor',   primaryKey = 
'uuid',   preCombineField = 'ts' ) PARTITIONED BY (city);
   
   INSERT INTO table2 VALUES 
(1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
 
(1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70
 ,'san_francisco'), 
(1695046462179,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90
 ,'san_francisco'), 
(1695332066204,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'),
 
(1695516137016,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo'
    ), 
(1695376420876,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40
 ,'sao_paulo'    ), 
(1695173887231,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06
 ,'chennai'      ), 
(1695115999911,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');
   
    SET hoodie.merge.small.file.group.candidates.limit = 0;
   
    UPDATE table2 SET fare = 20.0 WHERE rider = 'rider-A';
   
    DELETE FROM table2 WHERE uuid = 'e3cf430c-889d-4015-bc98-59bdce1e530c';
   
    select * from table2; {code}
   I tested with optimized writes enabled and disabled. When optimized writes 
are disabled, there is no warning about position fallback
   
   Here is with optimized writes false:
   {code:java}
   spark-sql (default)> SET hoodie.spark.sql.optimized.writes.enable = false;
   24/11/21 16:11:45 WARN DFSPropertiesConfiguration: Properties file 
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
   24/11/21 16:11:45 WARN DFSPropertiesConfiguration: Cannot find 
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
   hoodie.spark.sql.optimized.writes.enable    false
   Time taken: 0.764 seconds, Fetched 1 row(s)
   spark-sql (default)> CREATE TABLE table2 (
                      >      ts BIGINT,
                      >      uuid STRING,
                      >      rider STRING,
                      >      driver STRING,
                      >      fare DOUBLE,
                      >      city STRING
                      >  ) USING HUDI
                      >  LOCATION 'file:///tmp/testpositions'
                      >  TBLPROPERTIES (
                      >    type = 'mor',
                      >    primaryKey = 'uuid',
                      >    preCombineField = 'ts'
                      >  )
                      >  PARTITIONED BY (city);
   24/11/21 16:11:52 WARN TableSchemaResolver: Could not find any data file 
written for commit, so could not get schema for table file:/tmp/testpositions
   Time taken: 0.384 seconds
   spark-sql (default)> INSERT INTO table2
                      >  VALUES
                      >  
(1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
                      >  
(1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70
 ,'san_francisco'),
                      >  
(1695046462179,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90
 ,'san_francisco'),
                      >  
(1695332066204,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'),
                      >  
(1695516137016,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo'
    ),
                      >  
(1695376420876,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40
 ,'sao_paulo'    ),
                      >  
(1695173887231,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06
 ,'chennai'      ),
                      >  
(1695115999911,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');
   24/11/21 16:12:02 WARN TableSchemaResolver: Could not find any data file 
written for commit, so could not get schema for table file:/tmp/testpositions
   24/11/21 16:12:03 WARN TableSchemaResolver: Could not find any data file 
written for commit, so could not get schema for table file:/tmp/testpositions
   24/11/21 16:12:05 WARN MetricsConfig: Cannot locate configuration: tried 
hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
   24/11/21 16:12:05 WARN HoodieBackedTableMetadataWriter: Skipping secondary 
index initialization as only one secondary index bootstrap at a time is 
supported for now. Provided: []
   # WARNING: Unable to attach Serviceability Agent. Unable to attach even with 
module exceptions: [org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: 
Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense 
failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense 
failed.]
   24/11/21 16:12:08 WARN HoodieBackedTableMetadataWriter: Skipping secondary 
index initialization as only one secondary index bootstrap at a time is 
supported for now. Provided: []
   Time taken: 5.728 seconds
   spark-sql (default)>  SET hoodie.merge.small.file.group.candidates.limit = 0;
   hoodie.merge.small.file.group.candidates.limit    0
   Time taken: 0.012 seconds, Fetched 1 row(s)
   spark-sql (default)>  UPDATE table2 SET fare = 20.0 WHERE rider = 'rider-A';
   24/11/21 16:12:16 WARN SparkStringUtils: Truncated the string representation 
of a plan since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
   24/11/21 16:12:16 WARN HoodieFileIndex: Data skipping requires both Metadata 
Table and at least one of Column Stats Index, Record Level Index, or Functional 
Index to be enabled as well! (isMetadataTableEnabled = false, 
isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, 
isFunctionalIndexEnabled = false, isBucketIndexEnable = false, 
isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)
   24/11/21 16:12:16 WARN HoodieBackedTableMetadataWriter: Skipping secondary 
index initialization as only one secondary index bootstrap at a time is 
supported for now. Provided: []
   24/11/21 16:12:17 WARN HoodieBackedTableMetadataWriter: Skipping secondary 
index initialization as only one secondary index bootstrap at a time is 
supported for now. Provided: []
   Time taken: 1.802 seconds
   spark-sql (default)>  DELETE FROM table2 WHERE uuid = 
'e3cf430c-889d-4015-bc98-59bdce1e530c';
   24/11/21 16:12:27 WARN HoodieFileIndex: Data skipping requires both Metadata 
Table and at least one of Column Stats Index, Record Level Index, or Functional 
Index to be enabled as well! (isMetadataTableEnabled = false, 
isColumnStatsIndexEnabled = false, isRecordIndexApplicable = false, 
isFunctionalIndexEnabled = false, isBucketIndexEnable = false, 
isPartitionStatsIndexEnabled = false), isBloomFiltersIndexEnabled = false)
   24/11/21 16:12:27 WARN HoodieBackedTableMetadataWriter: Skipping secondary 
index initialization as only one secondary index bootstrap at a time is 
supported for now. Provided: []
   24/11/21 16:12:27 WARN HoodieBackedTableMetadataWriter: Skipping secondary 
index initialization as only one secondary index bootstrap at a time is 
supported for now. Provided: []
   Time taken: 1.332 seconds
   spark-sql (default)>  select * from table2;
   20241121161203621    20241121161203621_0_0    
1dced545-862b-4ceb-8b43-d2a568f6616b    city=san_francisco    
1ad629cc-6f75-4ac3-bff2-e4f842421f51-0_0-21-67_20241121161203621.parquet    
1695332066204    1dced545-862b-4ceb-8b43-d2a568f6616b    rider-E    driver-O    
93.5    san_francisco
   20241121161203621    20241121161203621_0_1    
e96c4396-3fad-413a-a942-4cb36106d721    city=san_francisco    
1ad629cc-6f75-4ac3-bff2-e4f842421f51-0_0-21-67_20241121161203621.parquet    
1695091554788    e96c4396-3fad-413a-a942-4cb36106d721    rider-C    driver-M    
27.7    san_francisco
   20241121161203621    20241121161203621_0_2    
9909a8b1-2d15-4d3d-8ec9-efc48c536a00    city=san_francisco    
1ad629cc-6f75-4ac3-bff2-e4f842421f51-0_0-21-67_20241121161203621.parquet    
1695046462179    9909a8b1-2d15-4d3d-8ec9-efc48c536a00    rider-D    driver-L    
33.9    san_francisco
   20241121161216516    20241121161216516_0_1    
334e26e9-8355-45cc-97c6-c31daf0df330    city=san_francisco    
1ad629cc-6f75-4ac3-bff2-e4f842421f51-0    1695159649087    
334e26e9-8355-45cc-97c6-c31daf0df330    rider-A    driver-K    20.0    
san_francisco
   20241121161203621    20241121161203621_1_1    
7a84095f-737f-40bc-b62f-6b69664712d2    city=sao_paulo    
c06df00f-d40d-42b1-b320-52de6bd05d0e-0_1-21-68_20241121161203621.parquet    
1695376420876    7a84095f-737f-40bc-b62f-6b69664712d2    rider-G    driver-Q    
43.4    sao_paulo
   20241121161203621    20241121161203621_2_0    
3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04    city=chennai    
41db64e9-04c0-4fcb-8378-ce50e0dc7c22-0_2-21-69_20241121161203621.parquet    
1695173887231    3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04    rider-I    driver-S    
41.06    chennai
   20241121161203621    20241121161203621_2_1    
c8abbe79-8d89-47ea-b4ce-4d224bae5bfa    city=chennai    
41db64e9-04c0-4fcb-8378-ce50e0dc7c22-0_2-21-69_20241121161203621.parquet    
1695115999911    c8abbe79-8d89-47ea-b4ce-4d224bae5bfa    rider-J    driver-T    
17.85    chennai
   Time taken: 0.219 seconds, Fetched 7 row(s) {code}
   And here it is without setting optimized writes to false, which has a 
default of true:
   {code:java}
   spark-sql (default)> CREATE TABLE table2 (                   >      ts 
BIGINT,                   >      uuid STRING,                   >      rider 
STRING,                   >      driver STRING,                   >      fare 
DOUBLE,                   >      city STRING                   >  ) USING HUDI  
                 >  LOCATION 'file:///tmp/testpositions'                   >  
TBLPROPERTIES (                   >    type = 'mor',                   >    
primaryKey = 'uuid',                   >    preCombineField = 'ts'              
     >  )                   >  PARTITIONED BY (city);24/11/21 16:14:20 WARN 
DFSPropertiesConfiguration: Properties file 
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props 
file24/11/21 16:14:20 WARN DFSPropertiesConfiguration: Cannot find 
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf24/11/21 16:14:20 
WARN TableSchemaResolver: Could not find any data file written for commit, so 
could not get schema for table file:/t
 mp/testpositionsTime taken: 1.004 secondsspark-sql (default)> INSERT INTO 
table2                   >  VALUES                   >  
(1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
                   >  
(1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70
 ,'san_francisco'),                   >  
(1695046462179,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90
 ,'san_francisco'),                   >  
(1695332066204,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'),
                   >  
(1695516137016,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo'
    ),                   >  
(1695376420876,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40
 ,'sao_paulo'    ),                   >  
(1695173887231,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06
 ,'chennai'      ),                   >  (1695115999911,'c8abbe7
 9-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');24/11/21 
16:14:28 WARN TableSchemaResolver: Could not find any data file written for 
commit, so could not get schema for table file:/tmp/testpositions24/11/21 
16:14:28 WARN TableSchemaResolver: Could not find any data file written for 
commit, so could not get schema for table file:/tmp/testpositions24/11/21 
16:14:30 WARN MetricsConfig: Cannot locate configuration: tried 
hadoop-metrics2-hbase.properties,hadoop-metrics2.properties24/11/21 16:14:31 
WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization 
as only one secondary index bootstrap at a time is supported for now. Provided: 
[]# WARNING: Unable to attach Serviceability Agent. Unable to attach even with 
module exceptions: [org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: 
Sense failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense 
failed., org.apache.hudi.org.openjdk.jol.vm.sa.SASupportException: Sense 
failed.]24/
 11/21 16:14:33 WARN HoodieBackedTableMetadataWriter: Skipping secondary index 
initialization as only one secondary index bootstrap at a time is supported for 
now. Provided: []Time taken: 5.734 secondsspark-sql (default)>  SET 
hoodie.merge.small.file.group.candidates.limit = 
0;hoodie.merge.small.file.group.candidates.limit  0Time taken: 0.016 seconds, 
Fetched 1 row(s)spark-sql (default)>  UPDATE table2 SET fare = 20.0 WHERE rider 
= 'rider-A';24/11/21 16:14:41 WARN SparkStringUtils: Truncated the string 
representation of a plan since it was too large. This behavior can be adjusted 
by setting 'spark.sql.debug.maxToStringFields'.24/11/21 16:14:41 WARN 
HoodieFileIndex: Data skipping requires both Metadata Table and at least one of 
Column Stats Index, Record Level Index, or Functional Index to be enabled as 
well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = false, 
isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, 
isBucketIndexEnable = false, isPartitionSta
 tsIndexEnabled = false), isBloomFiltersIndexEnabled = false)24/11/21 16:14:41 
WARN HoodieBackedTableMetadataWriter: Skipping secondary index initialization 
as only one secondary index bootstrap at a time is supported for now. Provided: 
[]24/11/21 16:14:42 WARN HoodieDataBlock: There are records without valid 
positions. Skip writing record positions to the data block header.24/11/21 
16:14:42 WARN HoodieBackedTableMetadataWriter: Skipping secondary index 
initialization as only one secondary index bootstrap at a time is supported for 
now. Provided: []Time taken: 1.59 secondsspark-sql (default)>  DELETE FROM 
table2 WHERE uuid = 'e3cf430c-889d-4015-bc98-59bdce1e530c';24/11/21 16:14:47 
WARN HoodieFileIndex: Data skipping requires both Metadata Table and at least 
one of Column Stats Index, Record Level Index, or Functional Index to be 
enabled as well! (isMetadataTableEnabled = false, isColumnStatsIndexEnabled = 
false, isRecordIndexApplicable = false, isFunctionalIndexEnabled = false, isBuc
 ketIndexEnable = false, isPartitionStatsIndexEnabled = false), 
isBloomFiltersIndexEnabled = false)24/11/21 16:14:47 WARN 
HoodieBackedTableMetadataWriter: Skipping secondary index initialization as 
only one secondary index bootstrap at a time is supported for now. Provided: 
[]24/11/21 16:14:47 WARN HoodiePositionBasedFileGroupRecordBuffer: No record 
position info is found when attempt to do position based merge.24/11/21 
16:14:47 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling back to key 
based merge for Read24/11/21 16:14:47 WARN HoodieDeleteBlock: There are delete 
records without valid positions. Skip writing record positions to the delete 
block header.24/11/21 16:14:47 WARN HoodieBackedTableMetadataWriter: Skipping 
secondary index initialization as only one secondary index bootstrap at a time 
is supported for now. Provided: []Time taken: 1.103 secondsspark-sql (default)> 
 select * from table2;24/11/21 16:14:53 WARN 
HoodiePositionBasedFileGroupRecordBuffer: No record position
  info is found when attempt to do position based merge.24/11/21 16:14:53 WARN 
HoodiePositionBasedFileGroupRecordBuffer: No record position info is found when 
attempt to do position based merge.24/11/21 16:14:53 WARN 
HoodiePositionBasedFileGroupRecordBuffer: Falling back to key based merge for 
Read24/11/21 16:14:53 WARN HoodiePositionBasedFileGroupRecordBuffer: Falling 
back to key based merge for Read20241121161428912   20241121161428912_0_0   
1dced545-862b-4ceb-8b43-d2a568f6616b    city=san_francisco  
cf8f187a-f827-454d-a26f-114e30c519ed-0_0-21-67_20241121161428912.parquet    
1695332066204   1dced545-862b-4ceb-8b43-d2a568f6616b    rider-E driver-O    
93.5    san_francisco20241121161428912   20241121161428912_0_1   
e96c4396-3fad-413a-a942-4cb36106d721    city=san_francisco  
cf8f187a-f827-454d-a26f-114e30c519ed-0_0-21-67_20241121161428912.parquet    
1695091554788   e96c4396-3fad-413a-a942-4cb36106d721    rider-C driver-M    
27.7    san_francisco20241121161428912   20241121161428912_0_
 2   9909a8b1-2d15-4d3d-8ec9-efc48c536a00    city=san_francisco  
cf8f187a-f827-454d-a26f-114e30c519ed-0_0-21-67_20241121161428912.parquet    
1695046462179   9909a8b1-2d15-4d3d-8ec9-efc48c536a00    rider-D driver-L    
33.9    san_francisco20241121161441739   20241121161441739_0_1   
334e26e9-8355-45cc-97c6-c31daf0df330    city=san_francisco  
cf8f187a-f827-454d-a26f-114e30c519ed-0  1695159649087   
334e26e9-8355-45cc-97c6-c31daf0df330    rider-A driver-K    20.0    
san_francisco20241121161428912   20241121161428912_1_1   
7a84095f-737f-40bc-b62f-6b69664712d2    city=sao_paulo  
22b6070f-6c72-4a3d-9fc6-8bac16a7e873-0_1-21-68_20241121161428912.parquet    
1695376420876   7a84095f-737f-40bc-b62f-6b69664712d2    rider-G driver-Q    
43.4    sao_paulo20241121161428912   20241121161428912_2_0   
3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04    city=chennai    
878ae75b-bb04-4ed8-8591-8fafc56ed7ba-0_2-21-69_20241121161428912.parquet    
1695173887231   3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04    rider-I driver-S  
   41.06   chennai20241121161428912   20241121161428912_2_1   
c8abbe79-8d89-47ea-b4ce-4d224bae5bfa    city=chennai    
878ae75b-bb04-4ed8-8591-8fafc56ed7ba-0_2-21-69_20241121161428912.parquet    
1695115999911   c8abbe79-8d89-47ea-b4ce-4d224bae5bfa    rider-J driver-T    
17.85   chennaiTime taken: 0.185 seconds, Fetched 7 row(s) {code};;;
   
   ---
   
   21/Nov/24 22:07;jonvex;To unblock the release we can either do one of two 
things:
    # Disable `hoodie.spark.sql.optimized.writes.enable`
    ## This will decrease performance of writes during keygen and index lookup
    ## Positions will be included in the updates
    ## we can check to see if positions are even enabled and only default it 
when position writing is disabled
    # Keep the code as is
    ## This will decrease performance during read of uncompacted filegroups
    ## We have the ability to fallback when positions are missing and I have 
written extensive test cases to ensure that fallback works correctly in all 
combinations of log and base files
    ## This can also use some extra disk space during the read because we have 
to rewrite mappings in the spillable map, and deletes to the spillable map 
don't actually free up space until we close it
   
    
   
   To actually write positions using the prepped workflow, I think there is a 
way to do this but will not be that easy:
   
   
    # We will need to read _tmp_metadata_row_index  inside the update and 
delete sql commands. Then during keygen, we will get the position from that 
field
    ## this will take some work because I don't think we have ever tried to 
read positions at the dataset level
    # Then we will need to get the positions out during key generation
    ## this should be easy
    # Then we will need to drop the column before we do the write
    ## this will probably be pretty easy
   
    
   
    ;;;
   
   ---
   
   26/Nov/24 00:30;yihua;Deferring this to Hudi 1.1 since this does not cause 
correctness issue and adding positional updates and deletes in SQL UPDATE and 
DELETE needs design.;;;
   
   ---
   
   10/Jan/25 00:18;yihua;In the UPDATE and DELETE command, we'll try creating 
the relation with a schema that has the row index meta column or a new hoodie 
meta column to attach the row index column to the return DF (this also requires 
the file group reader and parquet reader to keep the new row index column by 
fixing the wiring).  In that way, we can pass the positions down to the prepped 
write flow and prepare the HoodieRecords with the current record location.;;;
   
   ---
   
   10/Jan/25 01:59;yihua;I have a draft PR up which makes the prepped upsert 
flow write record positions to the log blocks from Spark SQL UPDATE statement.  
I'm going to fix a few issues before opening it up for review.;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Spark SQL UPDATE and DELETE should write record positions [hudi]

Reply via email to