[GitHub] [hudi] nsivabalan commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimalType

GitBox Sat, 11 Dec 2021 22:09:21 -0800


nsivabalan commented on a change in pull request #4253:
URL: https://github.com/apache/hudi/pull/4253#discussion_r767174406




##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java
##########
@@ -46,13 +53,32 @@
   public HoodieRowParquetWriteSupport(Configuration conf, StructType 
structType, BloomFilter bloomFilter, HoodieWriteConfig writeConfig) {
     super();
     Configuration hadoopConf = new Configuration(conf);
-    hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
writeConfig.parquetWriteLegacyFormatEnabled());
+    hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
findSmallPrecisionDecimalType(structType) ? "true" : 
writeConfig.parquetWriteLegacyFormatEnabled());

Review comment:
       Can we move the determination of right value for 
"spark.sql.parquet.writeLegacyFormat" to some higher layer. just once per 
driver. and also, lets add a warn log just incase we are overriding user config 
(parquetWriteLegacyFormatEnabled). 

##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java
##########
@@ -46,13 +53,32 @@
   public HoodieRowParquetWriteSupport(Configuration conf, StructType 
structType, BloomFilter bloomFilter, HoodieWriteConfig writeConfig) {
     super();
     Configuration hadoopConf = new Configuration(conf);
-    hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
writeConfig.parquetWriteLegacyFormatEnabled());
+    hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
findSmallPrecisionDecimalType(structType) ? "true" : 
writeConfig.parquetWriteLegacyFormatEnabled());
     hadoopConf.set("spark.sql.parquet.outputTimestampType", 
writeConfig.parquetOutputTimestampType());
     this.hadoopConf = hadoopConf;
     setSchema(structType, hadoopConf);
     this.bloomFilter = bloomFilter;
   }
 
+  // Now by default ParquetWriteSupport will write DecimalType to parquet as 
int32/int64 when the scale of decimalType < Decimal.MAX_LONG_DIGITS(),
+  // but AvroParquetReader which used by HoodieParquetReader cannot support 
read int32/int64 as DecimalType.
+  // try to find current sparkType whether contains that DecimalType.
+  private boolean findSmallPrecisionDecimalType(DataType sparkType) {

Review comment:
       nit: foundDecimalTypeOfSmallPrecision

##########
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
##########
@@ -723,4 +723,26 @@ class TestCOWDataSource extends HoodieClientTestBase {
     val result = spark.sql("select * from tmptable limit 1").collect()(0)
     result.schema.contains(new StructField("partition", StringType, true))
   }
+
+  @Test
+  def testWriteSmallPrecisionDecimalTable(): Unit = {
+    val records1 = recordsToStrings(dataGen.generateInserts("001", 5)).toList
+    val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+      .withColumn("shortDecimal", lit(new java.math.BigDecimal(s"2090.0000"))) 
// create decimalType(8, 4)
+    inputDF1.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION.key, 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
+      .mode(SaveMode.Overwrite)
+      .save(basePath)
+
+    val records2 = recordsToStrings(dataGen.generateUpdates("002", 5)).toList
+    val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
+      .withColumn("shortDecimal", lit(new java.math.BigDecimal(s"2090.0000"))) 
// create decimalType(8, 4)
+    inputDF2.write.format("org.apache.hudi")
+      .options(commonOpts)
+      .option(DataSourceWriteOptions.OPERATION.key, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
+      .mode(SaveMode.Append)
+      .save(basePath)
+    assert(spark.read.format("hudi").load(basePath).count() == 5)

Review comment:
       can we add an assertion to check equality of dataset for inputDf2 and 
snapshot read from hudi. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimalType

Reply via email to