[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimal
xiarixiaoyao commented on a change in pull request #4253: URL: https://github.com/apache/hudi/pull/4253#discussion_r769596391 ## File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala ## @@ -723,4 +723,29 @@ class TestCOWDataSource extends HoodieClientTestBase { val result = spark.sql("select * from tmptable limit 1").collect()(0) result.schema.contains(new StructField("partition", StringType, true)) } + + @Test + def testWriteSmallPrecisionDecimalTable(): Unit = { Review comment: It is difficult to get the value of hoodie.parquet.writeLegacyFormat.enabled directly from spark. add functions test for autoModifyParquetWriteLegacyFormatParameter to cover all scenes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimal
xiarixiaoyao commented on a change in pull request #4253: URL: https://github.com/apache/hudi/pull/4253#discussion_r769594604 ## File path: hudi-spark-datasource/hudi-spark2/src/main/java/org/apache/hudi/internal/DefaultSource.java ## @@ -62,10 +68,19 @@ public DataSourceReader createReader(DataSourceOptions options) { String instantTime = options.get(DataSourceInternalWriterHelper.INSTANT_TIME_OPT_KEY).get(); String path = options.get("path").get(); String tblName = options.get(HoodieWriteConfig.TBL_NAME.key()).get(); +Map parameters = options.asMap(); boolean populateMetaFields = options.getBoolean(HoodieTableConfig.POPULATE_META_FIELDS.key(), Boolean.parseBoolean(HoodieTableConfig.POPULATE_META_FIELDS.defaultValue())); +// Now by default ParquetWriteSupport will write DecimalType to parquet as int32/int64 when the scale of decimalType < Decimal.MAX_LONG_DIGITS(), Review comment: fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimal
xiarixiaoyao commented on a change in pull request #4253: URL: https://github.com/apache/hudi/pull/4253#discussion_r767376871 ## File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala ## @@ -723,4 +723,26 @@ class TestCOWDataSource extends HoodieClientTestBase { val result = spark.sql("select * from tmptable limit 1").collect()(0) result.schema.contains(new StructField("partition", StringType, true)) } + + @Test + def testWriteSmallPrecisionDecimalTable(): Unit = { +val records1 = recordsToStrings(dataGen.generateInserts("001", 5)).toList +val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2)) + .withColumn("shortDecimal", lit(new java.math.BigDecimal(s"2090."))) // create decimalType(8, 4) +inputDF1.write.format("org.apache.hudi") + .options(commonOpts) + .option(DataSourceWriteOptions.OPERATION.key, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) + .mode(SaveMode.Overwrite) + .save(basePath) + +val records2 = recordsToStrings(dataGen.generateUpdates("002", 5)).toList +val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2)) + .withColumn("shortDecimal", lit(new java.math.BigDecimal(s"2090."))) // create decimalType(8, 4) +inputDF2.write.format("org.apache.hudi") + .options(commonOpts) + .option(DataSourceWriteOptions.OPERATION.key, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) + .mode(SaveMode.Append) + .save(basePath) +assert(spark.read.format("hudi").load(basePath).count() == 5) Review comment: yes, fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimal
xiarixiaoyao commented on a change in pull request #4253: URL: https://github.com/apache/hudi/pull/4253#discussion_r767376414 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java ## @@ -46,13 +53,32 @@ public HoodieRowParquetWriteSupport(Configuration conf, StructType structType, BloomFilter bloomFilter, HoodieWriteConfig writeConfig) { super(); Configuration hadoopConf = new Configuration(conf); -hadoopConf.set("spark.sql.parquet.writeLegacyFormat", writeConfig.parquetWriteLegacyFormatEnabled()); +hadoopConf.set("spark.sql.parquet.writeLegacyFormat", findSmallPrecisionDecimalType(structType) ? "true" : writeConfig.parquetWriteLegacyFormatEnabled()); Review comment: good suggestion, fixed ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java ## @@ -46,13 +53,32 @@ public HoodieRowParquetWriteSupport(Configuration conf, StructType structType, BloomFilter bloomFilter, HoodieWriteConfig writeConfig) { super(); Configuration hadoopConf = new Configuration(conf); -hadoopConf.set("spark.sql.parquet.writeLegacyFormat", writeConfig.parquetWriteLegacyFormatEnabled()); +hadoopConf.set("spark.sql.parquet.writeLegacyFormat", findSmallPrecisionDecimalType(structType) ? "true" : writeConfig.parquetWriteLegacyFormatEnabled()); hadoopConf.set("spark.sql.parquet.outputTimestampType", writeConfig.parquetOutputTimestampType()); this.hadoopConf = hadoopConf; setSchema(structType, hadoopConf); this.bloomFilter = bloomFilter; } + // Now by default ParquetWriteSupport will write DecimalType to parquet as int32/int64 when the scale of decimalType < Decimal.MAX_LONG_DIGITS(), + // but AvroParquetReader which used by HoodieParquetReader cannot support read int32/int64 as DecimalType. + // try to find current sparkType whether contains that DecimalType. + private boolean findSmallPrecisionDecimalType(DataType sparkType) { Review comment: fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimal
xiarixiaoyao commented on a change in pull request #4253: URL: https://github.com/apache/hudi/pull/4253#discussion_r765414682 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java ## @@ -46,13 +53,32 @@ public HoodieRowParquetWriteSupport(Configuration conf, StructType structType, BloomFilter bloomFilter, HoodieWriteConfig writeConfig) { super(); Configuration hadoopConf = new Configuration(conf); -hadoopConf.set("spark.sql.parquet.writeLegacyFormat", writeConfig.parquetWriteLegacyFormatEnabled()); +hadoopConf.set("spark.sql.parquet.writeLegacyFormat", findSmallPrecisionDecimalType(structType) ? "true" : writeConfig.parquetWriteLegacyFormatEnabled()); Review comment: @codope thanks for your review。 yes,I think if findsmallprecisiondecimaltype returns false, we need respect the user's settings; if findsmallprecisiondecimaltype returns true,we need ignore user's choice, because the user's choosie may lead the failure of subsequent updating of the Hudi table -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org