[jira] [Updated] (HUDI-1392) lose partition info when using spark parameter "basePath"
[ https://issues.apache.org/jira/browse/HUDI-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] steven zhang updated HUDI-1392: --- Description: Reproduce the issue with below steps: set hoodie.datasource.write.hive_style_partitioning->true spark.read().format("org.apache.hudi").option("mergeSchema", true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/*" : "/*")).createOrReplaceTempView(hudiTable); spark.sql("select * from hudiTable where date>'20200807'").explain(); print PartitionFilters: [] the reason is: step 1. spark read datasource (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala L 317) case (dataSource: RelationProvider, None) => dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) //caseInsensitiveOptions CaseInsensitiveMap type step 2. hudi create relation org.apache.hudi.DefaultSource#createRelation(sqlContext: SQLContext,optParams: Map[String, String],schema: StructType): BaseRelation = { // the type optParams is CaseInsensitiveMap. and parameters type will be converted to Map thought Map ++ val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams) step 3. hudi transform to parquet relation if we query table(cow type) data then it will call getBaseFileOnlyView(sqlContext, parameters, schema, readPaths, isBootstrappedTable, globPaths, metaClient) it will create new Datasource and relation instance with : DataSource.apply(sparkSession = sqlContext.sparkSession,paths = extraReadPaths,userSpecifiedSchema = Option(schema),className = "parquet",options = optParams).resolveRelation() step 4. spark fetch basePath for infer partition info (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala L196) //the parameters come from DataSource #options (map type) parameters.get(BASE_PATH_PARAM) so parameters.get(BASE_PATH_PARAM) will call Map#get not CaseInsensitiveMap#get. and parameters stored “bathpath” . get “bathPath” will return None this is a spark bug (fixed at 3.0.1 version https://issues.apache.org/jira/browse/SPARK-32368) hudi current used spark v2.4.4 in order to avoid this spark issure a simple solution is we can not convert the input optParams type(spark already make it CaseInsensitiveMap type) in org.apache.hudi.DefaultSource#createRelation(sqlContext: SQLContext,optParams: Map[String, String]… was: Reproduce the issue with below steps: set hoodie.datasource.write.hive_style_partitioning->true spark.read().format("org.apache.hudi").option("mergeSchema", true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/*" : "/*")).createOrReplaceTempView(hudiTable); spark.sql("select * from hudiTable where date>'20200807'").explain(); print PartitionFilters: [] the cause of this issue is org.apache.hudi.DefaultSource#createRelation is call by dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)([https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala] L318) the input optParams is CaseInsensitiveMap type. hudi attached additional parameters such as val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams) the parameters type has been converted Map not CaseInsensitiveMap parquet datasource infer Partition info will fetch basePath value thought parameters.get(BASE_PATH_PARAM) ( [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala] L196) then the get method will not call CaseInsensitiveMap#get. just call Map#get("bathPath") and return None. so it will cause infer nothing partition info. and i found spark 2.4.7 version above ( https://issues.apache.org/jira/browse/SPARK-32364 ) has use caseInsensitiveMap to fetch basePath although the intention of it is not same as this hudi issue. and the lower spark version also has this issue. so we need using val parameters = translateViewTypesToQueryTypes(optParams) ++ Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) for two reason: 1.lower spark version also has this issue 2. original type converted > lose partition info when using spark parameter "basePath" > -- > > Key: HUDI-1392 > URL: https://issues.apache.org/jira/browse/HUDI-1392 > Project: Apache Hudi > Issue Type: Bug > Components: Spark
[GitHub] [hudi] quitozang closed issue #2274: [SUPPORT]
quitozang closed issue #2274: URL: https://github.com/apache/hudi/issues/2274 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] shenh062326 commented on pull request #2222: [HUDI-1364] Add HoodieJavaEngineContext to hudi-java-client
shenh062326 commented on pull request #: URL: https://github.com/apache/hudi/pull/#issuecomment-733449190 > @shenh062326 are you planning to follow on with a full impl of a java based client? Changes LGTM. Yes, I will add a full impl of a java based client. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1416) [Documentation] Documentation is confusing
Hemanga Borah created HUDI-1416: --- Summary: [Documentation] Documentation is confusing Key: HUDI-1416 URL: https://issues.apache.org/jira/browse/HUDI-1416 Project: Apache Hudi Issue Type: Improvement Components: Docs Reporter: Hemanga Borah Doc: [https://hudi.apache.org/docs/concepts.html#merge-on-read-table] The doc says, "Merge on read table is a superset of copy on write, in the sense it still supports read optimized queries of the table by exposing only the base/columnar files in latest file slices." However, above in the table (https://hudi.apache.org/docs/concepts.html#table-types--queries), it is mentioned that only "Merge On Read" supports "Read Optimized Queries". Another way of writing this would be: "Merge on read table is a superset of copy on write, in the sense that it *additionally* supports read optimized queries of the table by exposing only the base/columnar files in latest file slices." -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] garyli1019 merged pull request #2243: HUDI-1392 lose partition info when using spark parameter basePath
garyli1019 merged pull request #2243: URL: https://github.com/apache/hudi/pull/2243 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-1392] lose partition info when using spark parameter basePath (#2243)
This is an automated email from the ASF dual-hosted git repository. garyli pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 56866a1 [HUDI-1392] lose partition info when using spark parameter basePath (#2243) 56866a1 is described below commit 56866a11fe8b7a0ef8340f221da30c83c72b85da Author: steven zhang AuthorDate: Wed Nov 25 11:55:33 2020 +0800 [HUDI-1392] lose partition info when using spark parameter basePath (#2243) Co-authored-by: zhang wen --- .../src/main/scala/org/apache/hudi/DataSourceOptions.scala | 10 +++--- hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala | 2 +- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala b/hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala index fc52b38..73f70e7 100644 --- a/hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala +++ b/hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala @@ -81,9 +81,13 @@ object DataSourceReadOptions { val translation = Map(VIEW_TYPE_READ_OPTIMIZED_OPT_VAL -> QUERY_TYPE_SNAPSHOT_OPT_VAL, VIEW_TYPE_INCREMENTAL_OPT_VAL -> QUERY_TYPE_INCREMENTAL_OPT_VAL, VIEW_TYPE_REALTIME_OPT_VAL -> QUERY_TYPE_SNAPSHOT_OPT_VAL) -if (optParams.contains(VIEW_TYPE_OPT_KEY) && !optParams.contains(QUERY_TYPE_OPT_KEY)) { - log.warn(VIEW_TYPE_OPT_KEY + " is deprecated and will be removed in a later release. Please use " + QUERY_TYPE_OPT_KEY) - optParams ++ Map(QUERY_TYPE_OPT_KEY -> translation(optParams(VIEW_TYPE_OPT_KEY))) +if (!optParams.contains(QUERY_TYPE_OPT_KEY)) { + if (optParams.contains(VIEW_TYPE_OPT_KEY)) { +log.warn(VIEW_TYPE_OPT_KEY + " is deprecated and will be removed in a later release. Please use " + QUERY_TYPE_OPT_KEY) +optParams ++ Map(QUERY_TYPE_OPT_KEY -> translation(optParams(VIEW_TYPE_OPT_KEY))) + } else { +optParams ++ Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) + } } else { optParams } diff --git a/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala b/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala index 1cf9bdb..4a78378 100644 --- a/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala +++ b/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala @@ -55,7 +55,7 @@ class DefaultSource extends RelationProvider optParams: Map[String, String], schema: StructType): BaseRelation = { // Add default options for unspecified read options keys. -val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams) +val parameters = translateViewTypesToQueryTypes(optParams) val path = parameters.get("path") val readPathsStr = parameters.get(DataSourceReadOptions.READ_PATHS_OPT_KEY)
[GitHub] [hudi] garyli1019 commented on pull request #2243: HUDI-1392 lose partition info when using spark parameter basePath
garyli1019 commented on pull request #2243: URL: https://github.com/apache/hudi/pull/2243#issuecomment-733445977 @yui2010 merging. Please assign the Jira ticket to yourself and close it. If you don't have contributor access yet, please send an email with your Jira ID to the dev mailing list and someone will add you to the project. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bithw1 edited a comment on issue #2276: [SUPPORT] java.lang.IllegalStateException: No Compaction request available
bithw1 edited a comment on issue #2276: URL: https://github.com/apache/hudi/issues/2276#issuecomment-733441100 The code that create/upsert the table is as follows, I have explicitly specified the following two lines to disable compaction. ``` .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "false") .option(DataSourceWriteOptions.ASYNC_COMPACT_ENABLE_OPT_KEY, "false") ``` Not sure how I could be able to exercise the compaction feature with code..Could you please help? @bvaradar ,Thanks! ``` package org.example.hudi import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.{HoodieCompactionConfig, HoodieIndexConfig, HoodieWriteConfig} import org.apache.hudi.index.HoodieIndex import org.apache.spark.sql.{SaveMode, SparkSession} case class MyOrder( name: String, price: String, creation_date: String, dt: String) object MORWorkTest { val overwrite1Data = Seq( MyOrder("A", "1", "2020-11-18 14:43:32", "2020-11-19"), MyOrder("B", "1", "2020-11-18 14:42:21", "2020-11-19"), MyOrder("C", "1", "2020-11-18 14:47:19", "2020-11-19"), MyOrder("D", "1", "2020-11-18 14:46:50", "2020-11-19") ) val insertUpdate1Data = Seq( MyOrder("A", "2", "2020-11-18 14:50:32", "2020-11-19"), MyOrder("B", "2", "2020-11-18 14:50:21", "2020-11-19"), MyOrder("C", "2", "2020-11-18 14:50:19", "2020-11-19"), MyOrder("D", "2", "2020-11-18 14:50:50", "2020-11-19") ) val insertUpdate2Data = Seq( MyOrder("A", "3", "2020-11-18 14:53:32", "2020-11-19"), MyOrder("B", "3", "2020-11-18 14:52:21", "2020-11-19"), MyOrder("C", "3", "2020-11-18 14:57:19", "2020-11-19"), MyOrder("D", "3", "2020-11-18 14:56:50", "2020-11-19") ) val spark = SparkSession.builder.appName("MORTest") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.warehouse.dir", "hdfs:///user/hive/warehouse") .enableHiveSupport().getOrCreate() val hudi_table = "hudi_hive_read_write_mor_5" val base_path = s"/data/hudi_demo/$hudi_table" def run(op: Int) = { val (data, saveMode) = op match { case 1 => (overwrite1Data, SaveMode.Overwrite) case 2 => (insertUpdate1Data, SaveMode.Append) case 3 => (insertUpdate2Data, SaveMode.Append) } import spark.implicits._ val insertData = spark.createDataset(data) insertData.write.format("hudi") .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "name") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "creation_date") .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "xyz") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, hudi_table) .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH, "true") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "dt") //table type: MOR .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) //disable async compact .option(DataSourceWriteOptions.ASYNC_COMPACT_ENABLE_OPT_KEY, "false") .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 100) //disable inline compact .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "false") .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, "jdbc:hive2://10.41.90.208:1") .option(HoodieWriteConfig.TABLE_NAME, hudi_table) .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.GLOBAL_BLOOM.name()) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "dt") .option("hoodie.insert.shuffle.parallelism", "2") .option("hoodie.upsert.shuffle.parallelism", "2") .mode(saveMode) .save(base_path); } def main(args: Array[String]): Unit = { //do overwrite run(1) //do upsert run(2) //do upsert run(3) println("===MOR is done=") } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bithw1 commented on issue #2276: [SUPPORT] java.lang.IllegalStateException: No Compaction request available
bithw1 commented on issue #2276: URL: https://github.com/apache/hudi/issues/2276#issuecomment-733441100 The code that create/upsert the table is as follows, I have explicitly specified the following two lines to disable compaction. .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "false") .option(DataSourceWriteOptions.ASYNC_COMPACT_ENABLE_OPT_KEY, "false") Not sure how I could be able to exercise the compaction feature with code..Could you please help? @bvaradar ,Thanks! ``` package org.example.hudi import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.{HoodieCompactionConfig, HoodieIndexConfig, HoodieWriteConfig} import org.apache.hudi.index.HoodieIndex import org.apache.spark.sql.{SaveMode, SparkSession} case class MyOrder( name: String, price: String, creation_date: String, dt: String) object MORWorkTest { val overwrite1Data = Seq( MyOrder("A", "1", "2020-11-18 14:43:32", "2020-11-19"), MyOrder("B", "1", "2020-11-18 14:42:21", "2020-11-19"), MyOrder("C", "1", "2020-11-18 14:47:19", "2020-11-19"), MyOrder("D", "1", "2020-11-18 14:46:50", "2020-11-19") ) val insertUpdate1Data = Seq( MyOrder("A", "2", "2020-11-18 14:50:32", "2020-11-19"), MyOrder("B", "2", "2020-11-18 14:50:21", "2020-11-19"), MyOrder("C", "2", "2020-11-18 14:50:19", "2020-11-19"), MyOrder("D", "2", "2020-11-18 14:50:50", "2020-11-19") ) val insertUpdate2Data = Seq( MyOrder("A", "3", "2020-11-18 14:53:32", "2020-11-19"), MyOrder("B", "3", "2020-11-18 14:52:21", "2020-11-19"), MyOrder("C", "3", "2020-11-18 14:57:19", "2020-11-19"), MyOrder("D", "3", "2020-11-18 14:56:50", "2020-11-19") ) val spark = SparkSession.builder.appName("MORTest") .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .config("spark.sql.warehouse.dir", "hdfs:///user/hive/warehouse") .enableHiveSupport().getOrCreate() val hudi_table = "hudi_hive_read_write_mor_5" val base_path = s"/data/hudi_demo/$hudi_table" def run(op: Int) = { val (data, saveMode) = op match { case 1 => (overwrite1Data, SaveMode.Overwrite) case 2 => (insertUpdate1Data, SaveMode.Append) case 3 => (insertUpdate2Data, SaveMode.Append) } import spark.implicits._ val insertData = spark.createDataset(data) insertData.write.format("hudi") .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "name") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "creation_date") .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "xyz") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, hudi_table) .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH, "true") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "dt") //table type: MOR .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL) //disable async compact .option(DataSourceWriteOptions.ASYNC_COMPACT_ENABLE_OPT_KEY, "false") .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 100) //disable inline compact .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "false") .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, "jdbc:hive2://10.41.90.208:1") .option(HoodieWriteConfig.TABLE_NAME, hudi_table) .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.GLOBAL_BLOOM.name()) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "dt") .option("hoodie.insert.shuffle.parallelism", "2") .option("hoodie.upsert.shuffle.parallelism", "2") .mode(saveMode) .save(base_path); } def main(args: Array[String]): Unit = { //do overwrite run(1) //do upsert run(2) //do upsert run(3) println("===MOR is done=") } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bithw1 commented on issue #2276: [SUPPORT] java.lang.IllegalStateException: No Compaction request available
bithw1 commented on issue #2276: URL: https://github.com/apache/hudi/issues/2276#issuecomment-733439132 Thanks @bvaradar , The files on hdfs are: ``` 0 2020-11-22 10:00 /data/hudi_demo/hudi_hive_read_write_mor_5/.hoodie/.aux 0 2020-11-22 10:01 /data/hudi_demo/hudi_hive_read_write_mor_5/.hoodie/.temp 1596 2020-11-22 10:00 /data/hudi_demo/hudi_hive_read_write_mor_5/.hoodie/20201122100045.deltacommit 979 2020-11-22 10:00 /data/hudi_demo/hudi_hive_read_write_mor_5/.hoodie/20201122100045.deltacommit.inflight 0 2020-11-22 10:00 /data/hudi_demo/hudi_hive_read_write_mor_5/.hoodie/20201122100045.deltacommit.requested 1646 2020-11-22 10:01 /data/hudi_demo/hudi_hive_read_write_mor_5/.hoodie/20201122100057.deltacommit 1639 2020-11-22 10:00 /data/hudi_demo/hudi_hive_read_write_mor_5/.hoodie/20201122100057.deltacommit.inflight 0 2020-11-22 10:00 /data/hudi_demo/hudi_hive_read_write_mor_5/.hoodie/20201122100057.deltacommit.requested 1647 2020-11-22 10:01 /data/hudi_demo/hudi_hive_read_write_mor_5/.hoodie/20201122100101.deltacommit 1639 2020-11-22 10:01 /data/hudi_demo/hudi_hive_read_write_mor_5/.hoodie/20201122100101.deltacommit.inflight 0 2020-11-22 10:01 /data/hudi_demo/hudi_hive_read_write_mor_5/.hoodie/20201122100101.deltacommit.requested 0 2020-11-22 10:00 /data/hudi_demo/hudi_hive_read_write_mor_5/.hoodie/archived 339 2020-11-22 10:00 /data/hudi_demo/hudi_hive_read_write_mor_5/.hoodie/hoodie.properties ``` When I run any of the commits time(20201122100045 20201122100057 20201122100101 ), all complains that: No Compaction request available This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-981) Use rocksDB as flink state backend
[ https://issues.apache.org/jira/browse/HUDI-981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxianghu reassigned HUDI-981: Assignee: chijunqing (was: wangxianghu) > Use rocksDB as flink state backend > -- > > Key: HUDI-981 > URL: https://issues.apache.org/jira/browse/HUDI-981 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu >Assignee: chijunqing >Priority: Major > > Use rocksDB as flink state backend -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] SteNicholas commented on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation
SteNicholas commented on pull request #2111: URL: https://github.com/apache/hudi/pull/2111#issuecomment-733426822 > @SteNicholas still interested in driving this forward? @vinothchandar , yes, I have discussed with @leesf offline. This week would be completed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.
asharma4-lucid commented on issue #2269: URL: https://github.com/apache/hudi/issues/2269#issuecomment-733323629 Yes this is a COW table. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2277: [SUPPORT]
bvaradar commented on issue #2277: URL: https://github.com/apache/hudi/issues/2277#issuecomment-733305873 @umehrot2 : Can you please take a look at this ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2276: [SUPPORT] java.lang.IllegalStateException: No Compaction request available
bvaradar commented on issue #2276: URL: https://github.com/apache/hudi/issues/2276#issuecomment-733304908 You can use hudi-cli and use "compactions show all" to list compactions and find the timestamp of one that is pending. Another option is to list .hoodie folder and find all the files .compaction.requested where there is no corresponding .commit file present. These are the pending compactions which you can use to run compaction. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.
bvaradar commented on issue #2269: URL: https://github.com/apache/hudi/issues/2269#issuecomment-733299492 @asharma4-lucid : ~5hrs is way too much. Can you disable cleaning using the config hoodie.clean.automatic=false and try. Is this a COW table ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2208: [HUDI-1040] Make Hudi support Spark 3
vinothchandar commented on pull request #2208: URL: https://github.com/apache/hudi/pull/2208#issuecomment-733204015 @giaosudau that seems like JVM crash. Not sure what in this PR could crash that. Do you have more diagnostic info? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.
asharma4-lucid commented on issue #2269: URL: https://github.com/apache/hudi/issues/2269#issuecomment-733174238 Thanks @bvaradar. I tried to insert just 5 records to the existing table with ~300K partitions and it took close to ~5 hrs. If I insert ~5 records in a new table it takes less than 2 mins. Is this extra time of ~5 hrs all because of cleaner and compaction processes? For our use case, we mostly get inserts. With that in mind, would it be beneficial for us if we switch to MOR from COW and do async compaction (I am most likely making an incorrect assumption that this huge extra processing time is only because of compaction) ? And also, since our data does not have frequent record level updates, would switching to MOR make any difference? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2278: [HUDI-1412] Make HoodieWriteConfig support setting different default …
codecov-io edited a comment on pull request #2278: URL: https://github.com/apache/hudi/pull/2278#issuecomment-733020702 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2278?src=pr=h1) Report > Merging [#2278](https://codecov.io/gh/apache/hudi/pull/2278?src=pr=desc) (12b85dc) into [master](https://codecov.io/gh/apache/hudi/commit/0ebef1c0a0e4b96616ee7e4372d3b9f0eb83a919?el=desc) (0ebef1c) will **decrease** coverage by `43.14%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2278/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2278?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2278 +/- ## = - Coverage 53.55% 10.41% -43.15% + Complexity 2774 48 -2726 = Files 348 50 -298 Lines 16115 1777-14338 Branches 1640 211 -1429 = - Hits 8631 185 -8446 + Misses 6785 1579 -5206 + Partials699 13 -686 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudispark | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `10.41% <ø> (-59.66%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2278?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%>
[GitHub] [hudi] wangxianghu commented on pull request #2271: [WIP][HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client
wangxianghu commented on pull request #2271: URL: https://github.com/apache/hudi/pull/2271#issuecomment-733026404 blocked by https://github.com/apache/hudi/pull/2278 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu removed a comment on pull request #2271: [WIP][HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client
wangxianghu removed a comment on pull request #2271: URL: https://github.com/apache/hudi/pull/2271#issuecomment-733023377 blocked by https://github.com/apache/hudi/pull/2278 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2278: [HUDI-1412] Make HoodieWriteConfig support setting different default …
codecov-io edited a comment on pull request #2278: URL: https://github.com/apache/hudi/pull/2278#issuecomment-733020702 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2278?src=pr=h1) Report > Merging [#2278](https://codecov.io/gh/apache/hudi/pull/2278?src=pr=desc) (40c6d23) into [master](https://codecov.io/gh/apache/hudi/commit/0ebef1c0a0e4b96616ee7e4372d3b9f0eb83a919?el=desc) (0ebef1c) will **decrease** coverage by `43.14%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2278/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2278?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2278 +/- ## = - Coverage 53.55% 10.41% -43.15% + Complexity 2774 48 -2726 = Files 348 50 -298 Lines 16115 1777-14338 Branches 1640 211 -1429 = - Hits 8631 185 -8446 + Misses 6785 1579 -5206 + Partials699 13 -686 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudispark | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `10.41% <ø> (-59.66%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2278?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | ... and
[GitHub] [hudi] wangxianghu commented on pull request #2271: [WIP][HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client
wangxianghu commented on pull request #2271: URL: https://github.com/apache/hudi/pull/2271#issuecomment-733023377 blocked by https://github.com/apache/hudi/pull/2278 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on pull request #2278: [HUDI-1412] Make HoodieWriteConfig support setting different default …
wangxianghu commented on pull request #2278: URL: https://github.com/apache/hudi/pull/2278#issuecomment-733022202 @yanghua please take a look when free This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2278: [HUDI-1412] Make HoodieWriteConfig support setting different default …
codecov-io commented on pull request #2278: URL: https://github.com/apache/hudi/pull/2278#issuecomment-733020702 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2278?src=pr=h1) Report > Merging [#2278](https://codecov.io/gh/apache/hudi/pull/2278?src=pr=desc) (12b85dc) into [master](https://codecov.io/gh/apache/hudi/commit/0ebef1c0a0e4b96616ee7e4372d3b9f0eb83a919?el=desc) (0ebef1c) will **decrease** coverage by `43.14%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2278/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2278?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2278 +/- ## = - Coverage 53.55% 10.41% -43.15% + Complexity 2774 48 -2726 = Files 348 50 -298 Lines 16115 1777-14338 Branches 1640 211 -1429 = - Hits 8631 185 -8446 + Misses 6785 1579 -5206 + Partials699 13 -686 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudispark | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `10.41% <ø> (-59.66%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2278?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2278/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | ... and [324
[jira] [Updated] (HUDI-1412) Make HoodieWriteConfig support setting different default value according to engine type
[ https://issues.apache.org/jira/browse/HUDI-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1412: - Labels: pull-request-available (was: ) > Make HoodieWriteConfig support setting different default value according to > engine type > --- > > Key: HUDI-1412 > URL: https://issues.apache.org/jira/browse/HUDI-1412 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu >Assignee: wangxianghu >Priority: Major > Labels: pull-request-available > > Currently, `HoodieIndexConfig` set its default index type to bloom, which is > suitable for spark engine. > But,since hoodie has supported flink engine, we should make the default > values setted according to engine user used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] wangxianghu opened a new pull request #2278: [HUDI-1412] Make HoodieWriteConfig support setting different default …
wangxianghu opened a new pull request #2278: URL: https://github.com/apache/hudi/pull/2278 …value according to engine type ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *Make HoodieWriteConfig support setting different default value according to engine type* ## Brief change log Currently, `HoodieIndexConfig` set its default index type to bloom, which is suitable for spark engine. But,since hoodie has supported flink engine, we should make the default values setted according to engine user used. ## Verify this pull request This pull request is already covered by existing tests, such as *org.apache.hudi.config.TestHoodieWriteConfig#testDefaultIndexAccordingToEngineType*. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1412) Make HoodieWriteConfig support setting different default value according to engine type
[ https://issues.apache.org/jira/browse/HUDI-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxianghu updated HUDI-1412: -- Description: Currently, `HoodieIndexConfig` set its default index type to bloom, which is suitable for spark engine. But,since hoodie has supported flink engine, we should make the default values setted according to engine user used. > Make HoodieWriteConfig support setting different default value according to > engine type > --- > > Key: HUDI-1412 > URL: https://issues.apache.org/jira/browse/HUDI-1412 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu >Assignee: wangxianghu >Priority: Major > > Currently, `HoodieIndexConfig` set its default index type to bloom, which is > suitable for spark engine. > But,since hoodie has supported flink engine, we should make the default > values setted according to engine user used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1412) Make HoodieWriteConfig support setting different default value according to engine type
[ https://issues.apache.org/jira/browse/HUDI-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxianghu updated HUDI-1412: -- Summary: Make HoodieWriteConfig support setting different default value according to engine type (was: Make HoodieConfig support setting different default value according to engine type) > Make HoodieWriteConfig support setting different default value according to > engine type > --- > > Key: HUDI-1412 > URL: https://issues.apache.org/jira/browse/HUDI-1412 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu >Assignee: wangxianghu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] codecov-io edited a comment on pull request #2216: [HUDI-1357] Added a check to ensure no records are lost during updates.
codecov-io edited a comment on pull request #2216: URL: https://github.com/apache/hudi/pull/2216#issuecomment-729776111 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2216?src=pr=h1) Report > Merging [#2216](https://codecov.io/gh/apache/hudi/pull/2216?src=pr=desc) (c8f05c9) into [master](https://codecov.io/gh/apache/hudi/commit/6310a2307abba94c7ff8a770f45462deae2c312e?el=desc) (6310a23) will **decrease** coverage by `43.26%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2216/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2216?src=pr=tree) ```diff @@ Coverage Diff @@ ## master#2216 +/- ## = - Coverage 53.67% 10.41% -43.27% + Complexity 2849 48 -2801 = Files 359 50 -309 Lines 16565 1777-14788 Branches 1782 211 -1571 = - Hits 8892 185 -8707 + Misses 6916 1579 -5337 + Partials757 13 -744 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudispark | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `10.41% <ø> (-59.69%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2216?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2216/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2216/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2216/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2216/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2216/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2216/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2216/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2216/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2216/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2216/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | ... and
[jira] [Commented] (HUDI-1414) HoodieInputFormat support for bucketed partitions
[ https://issues.apache.org/jira/browse/HUDI-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238081#comment-17238081 ] linshan-ma commented on HUDI-1414: -- I'm interested in this ticket。 I want to try it. > HoodieInputFormat support for bucketed partitions > - > > Key: HUDI-1414 > URL: https://issues.apache.org/jira/browse/HUDI-1414 > Project: Apache Hudi > Issue Type: New Feature > Components: Presto Integration >Reporter: Satish Kotha >Priority: Major > Fix For: 0.8.0 > > > When querying a hoodie partition through presto, we get following error: > {code} > Presto error: {u'errorCode': 13, u'message': u'Presto cannot read bucketed > partition in an input format with UseFileSplitsFromInputFormat annotation: > HoodieInputFormat', u'errorType': u'USER_ERROR', u'failureInfo': > {u'suppressed': [], u'message': u'Presto cannot read bucketed partition in an > input format with UseFileSplitsFromInputFormat annotation: > HoodieInputFormat', u'type': u'com.facebook.presto.spi.PrestoException', > u'stack': > [u'com.facebook.presto.hive.BackgroundHiveSplitLoader.lambda$loadPartition$5(BackgroundHiveSplitLoader.java:432)', > > u'com.facebook.presto.hive.authentication.UserGroupInformationUtils.lambda$executeActionInDoAs$0(UserGroupInformationUtils.java:29)', > u'java.base/java.security.AccessController.doPrivileged(Native Method)', > u'java.base/javax.security.auth.Subject.doAs(Subject.java:361)', > u'org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1816)', > > u'com.facebook.presto.hive.authentication.UserGroupInformationUtils.executeActionInDoAs(UserGroupInformationUtils.java:27)', > > u'com.facebook.presto.hive.authentication.ImpersonatingHdfsAuthentication.doAs(ImpersonatingHdfsAuthentication.java:39)', > > u'com.facebook.presto.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:430)', > > u'com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:330)', > > u'com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:116)', > > u'com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:259)', > > u'com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)', > > u'com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)', > > u'com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)', > > u'com.facebook.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)', > > u'java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)', > > u'java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)', > u'java.base/java.lang.Thread.run(Thread.java:834)']}, u'errorName': > u'NOT_SUPPORTED'} > {code} > Figure out how to add support for bucketed partitions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1414) HoodieInputFormat support for bucketed partitions
[ https://issues.apache.org/jira/browse/HUDI-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] linshan-ma reassigned HUDI-1414: Assignee: linshan-ma > HoodieInputFormat support for bucketed partitions > - > > Key: HUDI-1414 > URL: https://issues.apache.org/jira/browse/HUDI-1414 > Project: Apache Hudi > Issue Type: New Feature > Components: Presto Integration >Reporter: Satish Kotha >Assignee: linshan-ma >Priority: Major > Fix For: 0.8.0 > > > When querying a hoodie partition through presto, we get following error: > {code} > Presto error: {u'errorCode': 13, u'message': u'Presto cannot read bucketed > partition in an input format with UseFileSplitsFromInputFormat annotation: > HoodieInputFormat', u'errorType': u'USER_ERROR', u'failureInfo': > {u'suppressed': [], u'message': u'Presto cannot read bucketed partition in an > input format with UseFileSplitsFromInputFormat annotation: > HoodieInputFormat', u'type': u'com.facebook.presto.spi.PrestoException', > u'stack': > [u'com.facebook.presto.hive.BackgroundHiveSplitLoader.lambda$loadPartition$5(BackgroundHiveSplitLoader.java:432)', > > u'com.facebook.presto.hive.authentication.UserGroupInformationUtils.lambda$executeActionInDoAs$0(UserGroupInformationUtils.java:29)', > u'java.base/java.security.AccessController.doPrivileged(Native Method)', > u'java.base/javax.security.auth.Subject.doAs(Subject.java:361)', > u'org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1816)', > > u'com.facebook.presto.hive.authentication.UserGroupInformationUtils.executeActionInDoAs(UserGroupInformationUtils.java:27)', > > u'com.facebook.presto.hive.authentication.ImpersonatingHdfsAuthentication.doAs(ImpersonatingHdfsAuthentication.java:39)', > > u'com.facebook.presto.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:430)', > > u'com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:330)', > > u'com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:116)', > > u'com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:259)', > > u'com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)', > > u'com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)', > > u'com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)', > > u'com.facebook.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)', > > u'java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)', > > u'java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)', > u'java.base/java.lang.Thread.run(Thread.java:834)']}, u'errorName': > u'NOT_SUPPORTED'} > {code} > Figure out how to add support for bucketed partitions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] liujinhui1994 commented on a change in pull request #2242: [HUDI-1366] Make deltasteamer support exporting data from hdfs to hudi
liujinhui1994 commented on a change in pull request #2242: URL: https://github.com/apache/hudi/pull/2242#discussion_r52946 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java ## @@ -522,14 +523,18 @@ public static void main(String[] args) throws Exception { */ private transient DeltaSync deltaSync; +private final HoodieDeltaStreamerConfig deltaStreamerConfig; + public DeltaSyncService(Config cfg, JavaSparkContext jssc, FileSystem fs, Configuration conf, Option properties) throws IOException { + this.props = properties.get(); this.cfg = cfg; this.jssc = jssc; this.sparkSession = SparkSession.builder().config(jssc.getConf()).getOrCreate(); this.asyncCompactService = Option.empty(); + this.deltaStreamerConfig = new HoodieDeltaStreamerConfig(props); - if (fs.exists(new Path(cfg.targetBasePath))) { + if (fs.exists(new Path(cfg.targetBasePath)) && !deltaStreamerConfig.getFullOverwrite()) { Review comment: The parameter itself only acts on DFSSouce,so is the command-line tool appropriate? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] liujinhui1994 commented on a change in pull request #2242: [HUDI-1366] Make deltasteamer support exporting data from hdfs to hudi
liujinhui1994 commented on a change in pull request #2242: URL: https://github.com/apache/hudi/pull/2242#discussion_r52946 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java ## @@ -522,14 +523,18 @@ public static void main(String[] args) throws Exception { */ private transient DeltaSync deltaSync; +private final HoodieDeltaStreamerConfig deltaStreamerConfig; + public DeltaSyncService(Config cfg, JavaSparkContext jssc, FileSystem fs, Configuration conf, Option properties) throws IOException { + this.props = properties.get(); this.cfg = cfg; this.jssc = jssc; this.sparkSession = SparkSession.builder().config(jssc.getConf()).getOrCreate(); this.asyncCompactService = Option.empty(); + this.deltaStreamerConfig = new HoodieDeltaStreamerConfig(props); - if (fs.exists(new Path(cfg.targetBasePath))) { + if (fs.exists(new Path(cfg.targetBasePath)) && !deltaStreamerConfig.getFullOverwrite()) { Review comment: The parameter itself only acts on DFSSouce This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] santas-little-helper-13 opened a new issue #2277: [SUPPORT]
santas-little-helper-13 opened a new issue #2277: URL: https://github.com/apache/hudi/issues/2277 Hi, I am working with hudi in AWS Glue. I have a problem with hudi updates. So I have one Glue job that inserts data into hudi parquet files, it reads data from glue table, does some processing, gets max ID_key from already existing data and adds it to the row number in order for Id_key to be unique on the whole table level. Now I have the other Glue job in which I read from that hudi table: `hudiDF = spark.read.format("hudi").load('s3://prct-parquet-tgt/test_task1' + "/*")` limit it to just one record and make changes in one column and in column upd_ind which is precombine field (all records have 0 by default as upd_ind): `updateDF = hudiDF.limit(1).withColumn('sequence', lit('new_value')).withColumn('upd_ind', lit(1))` then I define hudi options: ``` hoodie_write_options = { 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'hoodie.parquet.compression.codec': 'snappy', 'hoodie.table.name': 'test_task1', 'hoodie.datasource.write.recordkey.field': 'ID_key', 'hoodie.datasource.write.hive_style_partitioning': True, 'hoodie.datasource.write.table.name': 'test_task1', 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.precombine.field': 'upd_ind', 'hoodie.datasource.write.insert.drop.duplicates': True, 'hoodie.datasource.write.partitionpath.field': "datehour", 'hoodie.upsert.shuffle.parallelism': 8, 'hoodie.insert.shuffle.parallelism': 8, 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', 'hoodie.parquet.small.file.limit': 0 } ``` and write the updated row: `updateDF.write.format('hudi').options(**hoodie_write_options).mode('append').save('s3://prct-parquet-tgt/test_task1')` The problem is that the record that gets updated is random and has no connection to the record that is shown in Glue job. If I define specific record, then update isn’t done at all: `updateDF = hudiDF.filter(col('ID_key')==64777).withColumn('sequence', lit('new_value')).withColumn('upd_ind', lit(1))` I need to update the exact record that I specify. Please help. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org