[GitHub] [hudi] boneanxs commented on a diff in pull request #8076: [HUDI-5884] Support bulk_insert for insert_overwrite and insert_overwrite_table

2023-04-26 Thread via GitHub


boneanxs commented on code in PR #8076:
URL: https://github.com/apache/hudi/pull/8076#discussion_r1177516056


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##
@@ -770,66 +770,70 @@ object HoodieSparkSqlWriter {
 }
   }
 
-  def bulkInsertAsRow(sqlContext: SQLContext,
+  def bulkInsertAsRow(writeClient: SparkRDDWriteClient[_],
+  parameters: Map[String, String],
   hoodieConfig: HoodieConfig,
   df: DataFrame,
+  mode: SaveMode,
   tblName: String,
   basePath: Path,
-  path: String,
   instantTime: String,
   writerSchema: Schema,
-  isTablePartitioned: Boolean): (Boolean, 
common.util.Option[String]) = {
+  tableConfig: HoodieTableConfig):
+  (Boolean, HOption[String], HOption[String], HOption[String], 
SparkRDDWriteClient[_], HoodieTableConfig) = {
 if (hoodieConfig.getBoolean(INSERT_DROP_DUPS)) {
   throw new HoodieException("Dropping duplicates with bulk_insert in row 
writer path is not supported yet")
 }
+val sqlContext = 
writeClient.getEngineContext.asInstanceOf[HoodieSparkEngineContext].getSqlContext
+val jsc = 
writeClient.getEngineContext.asInstanceOf[HoodieSparkEngineContext].getJavaSparkContext
 
 val writerSchemaStr = writerSchema.toString
 
-val opts = hoodieConfig.getProps.toMap ++
+// Make opts mutable since it could be modified by 
tryOverrideParquetWriteLegacyFormatProperty
+val opts = mutable.Map() ++ hoodieConfig.getProps.toMap ++
   Map(HoodieWriteConfig.AVRO_SCHEMA_STRING.key -> writerSchemaStr)
 
-val writeConfig = DataSourceUtils.createHoodieConfig(writerSchemaStr, 
path, tblName, mapAsJavaMap(opts))
-val populateMetaFields = 
hoodieConfig.getBoolean(HoodieTableConfig.POPULATE_META_FIELDS)
-
-val bulkInsertPartitionerRows: BulkInsertPartitioner[Dataset[Row]] = if 
(populateMetaFields) {
-  val userDefinedBulkInsertPartitionerOpt = 
DataSourceUtils.createUserDefinedBulkInsertPartitionerWithRows(writeConfig)
-  if (userDefinedBulkInsertPartitionerOpt.isPresent) {
-userDefinedBulkInsertPartitionerOpt.get
-  } else {
-BulkInsertInternalPartitionerWithRowsFactory.get(writeConfig, 
isTablePartitioned)
-  }
-} else {
-  // Sort modes are not yet supported when meta fields are disabled
-  new NonSortPartitionerWithRows()
+// Auto set the value of "hoodie.parquet.writelegacyformat.enabled"
+tryOverrideParquetWriteLegacyFormatProperty(opts, 
convertAvroSchemaToStructType(writerSchema))
+val writeConfig = DataSourceUtils.createHoodieConfig(writerSchemaStr, 
basePath.toString, tblName, opts)
+val executor = mode match {
+  case SaveMode.Append =>
+new DatasetBulkInsertActionExecutor(writeConfig, writeClient, 
instantTime)

Review Comment:
   writeClient is specifically for `RDD[HoodieRecord]`, since all 
`xxxActionExecutor` here are `Dataset[Row]` based, I didn't put these logic 
there before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8505: [HUDI-6106] Spark offline compaction/Clustering Job will do clean like Flink job

2023-04-26 Thread via GitHub


hudi-bot commented on PR #8505:
URL: https://github.com/apache/hudi/pull/8505#issuecomment-1522992882

   
   ## CI report:
   
   * f7c73e83812258b53b979afbd6d465e9066b801f UNKNOWN
   * 269fad02a5346121e823a15c9804e2e63eb16c30 UNKNOWN
   * 442430f680316bdfefc27c4aca9f7cd94e95373c UNKNOWN
   *  Unknown: [CANCELED](TBD) 
   * 4dad96ba54827548c95059d12b7d5d5cdcc0c1a4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16673)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] SteNicholas commented on a diff in pull request #8503: [HUDI-6047] Clustering operation on consistent hashing index resulting in duplicate data

2023-04-26 Thread via GitHub


SteNicholas commented on code in PR #8503:
URL: https://github.com/apache/hudi/pull/8503#discussion_r1177521748


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java:
##
@@ -509,7 +509,15 @@ private Stream getCommitInstantsToArchive() 
throws IOException {
   }
 
   private Stream getInstantsToArchive() throws IOException {
-Stream instants = 
Stream.concat(getCleanInstantsToArchive(), getCommitInstantsToArchive());
+List commitInstantsToArchive = 
getCommitInstantsToArchive().collect(Collectors.toList());
+Stream instants = 
Stream.concat(getCleanInstantsToArchive(), commitInstantsToArchive.stream());
+HoodieInstant hoodieOldestInstantToArchive = 
commitInstantsToArchive.stream().max(Comparator.comparing(maxInstant -> 
maxInstant.getTimestamp())).orElse(null);
+/**
+ * if hoodieOldestInstantToArchive is null that means nothing is getting 
archived, so no need to update metadata
+ */
+if (hoodieOldestInstantToArchive != null) {
+  table.getIndex().updateMetadata(table, 
Option.of(hoodieOldestInstantToArchive));

Review Comment:
   @rohan-uptycs, `getInstantsToArchive` is also used to get instants to 
archive, therefore the update behavior couldn't invoke in this method for 
design.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7627: [HUDI-5517] HoodieTimeline support filter instants by state transition time

2023-04-26 Thread via GitHub


hudi-bot commented on PR #7627:
URL: https://github.com/apache/hudi/pull/7627#issuecomment-1522990780

   
   ## CI report:
   
   * 85b25f5cda4ccd8189a1607259e1732a910c3262 UNKNOWN
   * bfb9fbbed9a2423ba1781962cea8ccc277a84880 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16642)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16672)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16660)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #7627: [HUDI-5517] HoodieTimeline support filter instants by state transition time

2023-04-26 Thread via GitHub


hudi-bot commented on PR #7627:
URL: https://github.com/apache/hudi/pull/7627#issuecomment-1522936046

   
   ## CI report:
   
   * 85b25f5cda4ccd8189a1607259e1732a910c3262 UNKNOWN
   * bfb9fbbed9a2423ba1781962cea8ccc277a84880 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16642)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16660)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16672)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8579: [MINOR] Added docs on gotchas when using PartialUpdateAvroPayload

2023-04-26 Thread via GitHub


hudi-bot commented on PR #8579:
URL: https://github.com/apache/hudi/pull/8579#issuecomment-1522928842

   
   ## CI report:
   
   * fa50b514ec994cde256ae1f85778648dc94e5ef6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16671)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8505: [HUDI-6106] Spark offline compaction/Clustering Job will do clean like Flink job

2023-04-26 Thread via GitHub


hudi-bot commented on PR #8505:
URL: https://github.com/apache/hudi/pull/8505#issuecomment-1522928401

   
   ## CI report:
   
   * f7c73e83812258b53b979afbd6d465e9066b801f UNKNOWN
   * 269fad02a5346121e823a15c9804e2e63eb16c30 UNKNOWN
   * 442430f680316bdfefc27c4aca9f7cd94e95373c UNKNOWN
   * 4dad96ba54827548c95059d12b7d5d5cdcc0c1a4 UNKNOWN
   *  Unknown: [CANCELED](TBD) 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] boneanxs commented on pull request #7627: [HUDI-5517] HoodieTimeline support filter instants by state transition time

2023-04-26 Thread via GitHub


boneanxs commented on PR #7627:
URL: https://github.com/apache/hudi/pull/7627#issuecomment-1522926916

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexone95 commented on issue #8535: [SUPPORT] manually deleting file under .hoodie/archived

2023-04-26 Thread via GitHub


alexone95 commented on issue #8535:
URL: https://github.com/apache/hudi/issues/8535#issuecomment-1522925284

   @nsivabalan the only problem is that we are using the hudi 12.0.1 version 
where the the issue is not fixed so we had to do this workaround to avoid this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8579: [MINOR] Added docs on gotchas when using PartialUpdateAvroPayload

2023-04-26 Thread via GitHub


hudi-bot commented on PR #8579:
URL: https://github.com/apache/hudi/pull/8579#issuecomment-1522918267

   
   ## CI report:
   
   * fa50b514ec994cde256ae1f85778648dc94e5ef6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #8505: [HUDI-6106] Spark offline compaction/Clustering Job will do clean like Flink job

2023-04-26 Thread via GitHub


hudi-bot commented on PR #8505:
URL: https://github.com/apache/hudi/pull/8505#issuecomment-1522917865

   
   ## CI report:
   
   * f7c73e83812258b53b979afbd6d465e9066b801f UNKNOWN
   * 269fad02a5346121e823a15c9804e2e63eb16c30 UNKNOWN
   * 442430f680316bdfefc27c4aca9f7cd94e95373c UNKNOWN
   *  Unknown: [CANCELED](TBD) 
   * 4dad96ba54827548c95059d12b7d5d5cdcc0c1a4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan closed issue #8109: [SUPPORT] Spark32PlusHoodieParquetFileFormat should set "SQLConf.LEGACY_PARQUET_NANOS_AS_LONG" ?

2023-04-26 Thread via GitHub


xushiyan closed issue #8109: [SUPPORT] Spark32PlusHoodieParquetFileFormat 
should set "SQLConf.LEGACY_PARQUET_NANOS_AS_LONG" ?
URL: https://github.com/apache/hudi/issues/8109


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ad1happy2go commented on issue #8502: [SUPPORT] Does spark.sql("MERGE INTO") supports schema evolution write option

2023-04-26 Thread via GitHub


ad1happy2go commented on issue #8502:
URL: https://github.com/apache/hudi/issues/8502#issuecomment-1522911733

   @jhchee Spark sql parser doesn't supports this so not sure if we can do 
anything on our end. All configs comes into play during the execution of sql. 
   
   you can do ALTER table first and add column before calling the merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xushiyan closed issue #8316: [SUPPORT] INSERT operation performance vs UPSERT operation

2023-04-26 Thread via GitHub


xushiyan closed issue #8316: [SUPPORT] INSERT operation performance vs UPSERT 
operation
URL: https://github.com/apache/hudi/issues/8316


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhuanshenbsj1 commented on pull request #8505: [HUDI-6106] Spark offline compaction/Clustering Job will do clean like Flink job

2023-04-26 Thread via GitHub


zhuanshenbsj1 commented on PR #8505:
URL: https://github.com/apache/hudi/pull/8505#issuecomment-1522901577

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



<    1   2   3