[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimal

2021-12-15 Thread GitBox


xiarixiaoyao commented on a change in pull request #4253:
URL: https://github.com/apache/hudi/pull/4253#discussion_r769596391



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
##
@@ -723,4 +723,29 @@ class TestCOWDataSource extends HoodieClientTestBase {
 val result = spark.sql("select * from tmptable limit 1").collect()(0)
 result.schema.contains(new StructField("partition", StringType, true))
   }
+
+  @Test
+  def testWriteSmallPrecisionDecimalTable(): Unit = {

Review comment:
   It is difficult to get the value of 
hoodie.parquet.writeLegacyFormat.enabled directly from spark.
   add functions test for autoModifyParquetWriteLegacyFormatParameter to cover 
all scenes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimal

2021-12-15 Thread GitBox


xiarixiaoyao commented on a change in pull request #4253:
URL: https://github.com/apache/hudi/pull/4253#discussion_r769594604



##
File path: 
hudi-spark-datasource/hudi-spark2/src/main/java/org/apache/hudi/internal/DefaultSource.java
##
@@ -62,10 +68,19 @@ public DataSourceReader createReader(DataSourceOptions 
options) {
 String instantTime = 
options.get(DataSourceInternalWriterHelper.INSTANT_TIME_OPT_KEY).get();
 String path = options.get("path").get();
 String tblName = options.get(HoodieWriteConfig.TBL_NAME.key()).get();
+Map parameters = options.asMap();
 boolean populateMetaFields = 
options.getBoolean(HoodieTableConfig.POPULATE_META_FIELDS.key(),
 
Boolean.parseBoolean(HoodieTableConfig.POPULATE_META_FIELDS.defaultValue()));
+// Now by default ParquetWriteSupport will write DecimalType to parquet as 
int32/int64 when the scale of decimalType < Decimal.MAX_LONG_DIGITS(),

Review comment:
   fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimal

2021-12-12 Thread GitBox


xiarixiaoyao commented on a change in pull request #4253:
URL: https://github.com/apache/hudi/pull/4253#discussion_r767376871



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
##
@@ -723,4 +723,26 @@ class TestCOWDataSource extends HoodieClientTestBase {
 val result = spark.sql("select * from tmptable limit 1").collect()(0)
 result.schema.contains(new StructField("partition", StringType, true))
   }
+
+  @Test
+  def testWriteSmallPrecisionDecimalTable(): Unit = {
+val records1 = recordsToStrings(dataGen.generateInserts("001", 5)).toList
+val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+  .withColumn("shortDecimal", lit(new java.math.BigDecimal(s"2090."))) 
// create decimalType(8, 4)
+inputDF1.write.format("org.apache.hudi")
+  .options(commonOpts)
+  .option(DataSourceWriteOptions.OPERATION.key, 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
+  .mode(SaveMode.Overwrite)
+  .save(basePath)
+
+val records2 = recordsToStrings(dataGen.generateUpdates("002", 5)).toList
+val inputDF2 = spark.read.json(spark.sparkContext.parallelize(records2, 2))
+  .withColumn("shortDecimal", lit(new java.math.BigDecimal(s"2090."))) 
// create decimalType(8, 4)
+inputDF2.write.format("org.apache.hudi")
+  .options(commonOpts)
+  .option(DataSourceWriteOptions.OPERATION.key, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
+  .mode(SaveMode.Append)
+  .save(basePath)
+assert(spark.read.format("hudi").load(basePath).count() == 5)

Review comment:
   yes, fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimal

2021-12-12 Thread GitBox


xiarixiaoyao commented on a change in pull request #4253:
URL: https://github.com/apache/hudi/pull/4253#discussion_r767376414



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java
##
@@ -46,13 +53,32 @@
   public HoodieRowParquetWriteSupport(Configuration conf, StructType 
structType, BloomFilter bloomFilter, HoodieWriteConfig writeConfig) {
 super();
 Configuration hadoopConf = new Configuration(conf);
-hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
writeConfig.parquetWriteLegacyFormatEnabled());
+hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
findSmallPrecisionDecimalType(structType) ? "true" : 
writeConfig.parquetWriteLegacyFormatEnabled());

Review comment:
   good suggestion, fixed

##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java
##
@@ -46,13 +53,32 @@
   public HoodieRowParquetWriteSupport(Configuration conf, StructType 
structType, BloomFilter bloomFilter, HoodieWriteConfig writeConfig) {
 super();
 Configuration hadoopConf = new Configuration(conf);
-hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
writeConfig.parquetWriteLegacyFormatEnabled());
+hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
findSmallPrecisionDecimalType(structType) ? "true" : 
writeConfig.parquetWriteLegacyFormatEnabled());
 hadoopConf.set("spark.sql.parquet.outputTimestampType", 
writeConfig.parquetOutputTimestampType());
 this.hadoopConf = hadoopConf;
 setSchema(structType, hadoopConf);
 this.bloomFilter = bloomFilter;
   }
 
+  // Now by default ParquetWriteSupport will write DecimalType to parquet as 
int32/int64 when the scale of decimalType < Decimal.MAX_LONG_DIGITS(),
+  // but AvroParquetReader which used by HoodieParquetReader cannot support 
read int32/int64 as DecimalType.
+  // try to find current sparkType whether contains that DecimalType.
+  private boolean findSmallPrecisionDecimalType(DataType sparkType) {

Review comment:
   fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimal

2021-12-08 Thread GitBox


xiarixiaoyao commented on a change in pull request #4253:
URL: https://github.com/apache/hudi/pull/4253#discussion_r765414682



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java
##
@@ -46,13 +53,32 @@
   public HoodieRowParquetWriteSupport(Configuration conf, StructType 
structType, BloomFilter bloomFilter, HoodieWriteConfig writeConfig) {
 super();
 Configuration hadoopConf = new Configuration(conf);
-hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
writeConfig.parquetWriteLegacyFormatEnabled());
+hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
findSmallPrecisionDecimalType(structType) ? "true" : 
writeConfig.parquetWriteLegacyFormatEnabled());

Review comment:
   @codope  thanks for your review。
   yes,I think if findsmallprecisiondecimaltype returns false, we need respect 
the user's settings; if findsmallprecisiondecimaltype returns true,we need 
ignore user's choice, because the user's choosie may lead  the failure of 
subsequent updating of the Hudi table




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org