[jira] [Updated] (HUDI-1297) [Umbrella] Spark Datasource Support
[ https://issues.apache.org/jira/browse/HUDI-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1297: - Priority: Blocker (was: Critical) > [Umbrella] Spark Datasource Support > --- > > Key: HUDI-1297 > URL: https://issues.apache.org/jira/browse/HUDI-1297 > Project: Apache Hudi > Issue Type: Epic > Components: spark >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Blocker > Labels: hudi-umbrellas > Fix For: 0.11.0 > > > Yet to be fully scoped out > But high level, we want to > * First class support for streaming reads/writes via structured streaming > * Row based reader/writers all the way > * Support for File/Partition pruning using Hudi metadata tables -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] xushiyan commented on issue #4411: [SUPPORT] - Presto Querying Issue in AWS EMR 6.3.1
xushiyan commented on issue #4411: URL: https://github.com/apache/hudi/issues/4411#issuecomment-1017200245 @rajgowtham24 i think you need `NonpartitionedKeyGenerator` instead for non-partitioned table. `default/` is created when complex key gen fails to extract partition path properly. With `default/` in the directory, your table is unexpectedly "partitioned" now, and in your `hive_sync.partition_extractor_class` you used `NonPartitionedExtractor`, that's probably causing reading issue. > @xushiyan - when we are using complexkey generator in 0.5.0, by default Hudi is creating the data under default partition(same is happening with 0.8.0 as well). So to query the table through Presto 0.230 we have to modify the table location s3://bucket_name/test/table_name/default. > > For partitioned table, without updating the table location we are able to query the table from Presto 0.230 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4647: [HUDI-3283] Bootstrap support overwrite existing table
hudi-bot removed a comment on pull request #4647: URL: https://github.com/apache/hudi/pull/4647#issuecomment-1017198356 ## CI report: * 6e9b2bfa50f349bc68e32f890a20df2da4b9f708 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4647: [HUDI-3283] Bootstrap support overwrite existing table
hudi-bot commented on pull request #4647: URL: https://github.com/apache/hudi/pull/4647#issuecomment-1017200201 ## CI report: * 6e9b2bfa50f349bc68e32f890a20df2da4b9f708 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5364) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] BruceKellan opened a new issue #4648: [SUPPORT] Upgrade Hudi to 0.10.1 from 0.10.0 using spark
BruceKellan opened a new issue #4648: URL: https://github.com/apache/hudi/issues/4648 **Describe the problem you faced** I am using spark3 + hudi 0.10.0. When I am upgrading hudi to 0.10.1-rc2, get this: java.io.InvalidClassException: org.apache.hudi.common.table.timeline.HoodieActiveTimeline; local class incompatible: stream classdesc serialVersionUID = -1280891512509140081, local class serialVersionUID = 1642514781003501811 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1941) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1807) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2098) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2343) at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:594) at org.apache.hudi.common.table.HoodieTableMetaClient.readObject(HoodieTableMetaClient.java:151) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2234) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2125) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1624) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2343) **Environment Description** * Hudi version : 0.10.1-rc2 using b670801afc110870f354766c872f386d18261add * Spark version : 3.1.2 * Hadoop version : 2.8.5 * Storage (HDFS/S3/GCS..) : Aliyun OSS * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4647: [HUDI-3283] Bootstrap support overwrite existing table
hudi-bot commented on pull request #4647: URL: https://github.com/apache/hudi/pull/4647#issuecomment-1017198356 ## CI report: * 6e9b2bfa50f349bc68e32f890a20df2da4b9f708 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3283) Bootstrap support overwrite existing table
[ https://issues.apache.org/jira/browse/HUDI-3283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-3283: - Labels: pull-request-available (was: ) > Bootstrap support overwrite existing table > -- > > Key: HUDI-3283 > URL: https://issues.apache.org/jira/browse/HUDI-3283 > Project: Apache Hudi > Issue Type: Task > Components: bootstrap >Reporter: Xianghu Wang >Assignee: Xianghu Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] wangxianghu opened a new pull request #4647: [HUDI-3283] Bootstrap support overwrite existing table
wangxianghu opened a new pull request #4647: URL: https://github.com/apache/hudi/pull/4647 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on a change in pull request #4083: [HUDI-2837] Add support for using database name in incremental query
dongkelun commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r788447754 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java ## @@ -76,6 +76,11 @@ public static final String HOODIE_PROPERTIES_FILE = "hoodie.properties"; public static final String HOODIE_PROPERTIES_FILE_BACKUP = "hoodie.properties.backup"; + public static final ConfigProperty DATABASE_NAME = ConfigProperty + .key("hoodie.database.name") + .noDefaultValue() + .withDocumentation("Database name that will be used for incremental query."); Review comment: Although spark sql will use it, it is only for incremental query at present.So do we need to change here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-3283) Bootstrap support overwrite existing table
[ https://issues.apache.org/jira/browse/HUDI-3283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianghu Wang reassigned HUDI-3283: -- Assignee: Xianghu Wang > Bootstrap support overwrite existing table > -- > > Key: HUDI-3283 > URL: https://issues.apache.org/jira/browse/HUDI-3283 > Project: Apache Hudi > Issue Type: Task > Components: bootstrap >Reporter: Xianghu Wang >Assignee: Xianghu Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3283) Bootstrap support overwrite existing table
Xianghu Wang created HUDI-3283: -- Summary: Bootstrap support overwrite existing table Key: HUDI-3283 URL: https://issues.apache.org/jira/browse/HUDI-3283 Project: Apache Hudi Issue Type: Task Components: bootstrap Reporter: Xianghu Wang -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] xushiyan commented on a change in pull request #4083: [HUDI-2837] Add support for using database name in incremental query
xushiyan commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r788428757 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java ## @@ -53,25 +57,27 @@ public static final Logger LOG = LogManager.getLogger(InputPathHandler.class); private final Configuration conf; - // tablename to metadata mapping for all Hoodie tables(both incremental & snapshot) + // tableName to metadata mapping for all Hoodie tables(both incremental & snapshot) private final Map tableMetaClientMap; private final Map> groupedIncrementalPaths; private final List snapshotPaths; private final List nonHoodieInputPaths; + private boolean isIncrementalUseDatabase; - public InputPathHandler(Configuration conf, Path[] inputPaths, List incrementalTables) throws IOException { + public InputPathHandler(Configuration conf, Path[] inputPaths, List incrementalTables, JobConf job) throws IOException { this.conf = conf; tableMetaClientMap = new HashMap<>(); snapshotPaths = new ArrayList<>(); nonHoodieInputPaths = new ArrayList<>(); groupedIncrementalPaths = new HashMap<>(); +this.isIncrementalUseDatabase = HoodieHiveUtils.isIncrementalUseDatabase(Job.getInstance(job)); Review comment: `JobConf` is passed in just to compute a boolean config? can we somehow make the config extracted from `Configuration conf` ? always prefer to have less args ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java ## @@ -117,9 +124,11 @@ private void parseInputPaths(Path[] inputPaths, List incrementalTables) } } - private void tagAsIncrementalOrSnapshot(Path inputPath, String tableName, - HoodieTableMetaClient metaClient, List incrementalTables) { -if (!incrementalTables.contains(tableName)) { + private void tagAsIncrementalOrSnapshot(Path inputPath, HoodieTableMetaClient metaClient, List incrementalTables) { +String databaseName = metaClient.getTableConfig().getDatabaseName(); +String tableName = metaClient.getTableConfig().getTableName(); +if ((isIncrementalUseDatabase && !StringUtils.isNullOrEmpty(databaseName) && !incrementalTables.contains(databaseName + "." + tableName)) +|| (!(isIncrementalUseDatabase && !StringUtils.isNullOrEmpty(databaseName)) && !incrementalTables.contains(tableName))) { Review comment: this condition check is pretty hard to read.. can we improve this? ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java ## @@ -95,19 +101,20 @@ private void parseInputPaths(Path[] inputPaths, List incrementalTables) // We already know the base path for this inputPath. basePathKnown = true; // Check if this is for a snapshot query - String tableName = metaClient.getTableConfig().getTableName(); - tagAsIncrementalOrSnapshot(inputPath, tableName, metaClient, incrementalTables); + tagAsIncrementalOrSnapshot(inputPath, metaClient, incrementalTables); break; } } if (!basePathKnown) { -// This path is for a table that we dont know about yet. +// This path is for a table that we don't know about yet. HoodieTableMetaClient metaClient; try { metaClient = getTableMetaClientForBasePath(inputPath.getFileSystem(conf), inputPath); + String databaseName = metaClient.getTableConfig().getDatabaseName(); String tableName = metaClient.getTableConfig().getTableName(); - tableMetaClientMap.put(tableName, metaClient); - tagAsIncrementalOrSnapshot(inputPath, tableName, metaClient, incrementalTables); + tableMetaClientMap.put(isIncrementalUseDatabase && !StringUtils.isNullOrEmpty(databaseName) + ? databaseName + "." + tableName : tableName, metaClient); Review comment: can we move the table name creation logic into a helper method? ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java ## @@ -76,6 +76,11 @@ public static final String HOODIE_PROPERTIES_FILE = "hoodie.properties"; public static final String HOODIE_PROPERTIES_FILE_BACKUP = "hoodie.properties.backup"; + public static final ConfigProperty DATABASE_NAME = ConfigProperty + .key("hoodie.database.name") + .noDefaultValue() + .withDocumentation("Database name that will be used for incremental query."); Review comment: this is not just for incremental query now, right? spark sql also uses it. Better to add more info here to explain the use cases of this config, which will show up in the website for users to understand. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscrib
[GitHub] [hudi] hudi-bot removed a comment on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive
hudi-bot removed a comment on pull request #3745: URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017147751 ## CI report: * 1cd9e56e69b15edac5a12afdb9626c853eb6e83f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5343) * 942072ece4f6fc46251f922eabe4f6bdac767410 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5362) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive
hudi-bot commented on pull request #3745: URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017178601 ## CI report: * 942072ece4f6fc46251f922eabe4f6bdac767410 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5362) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-2837) The original hoodie.table.name should be maintained in Spark SQL
[ https://issues.apache.org/jira/browse/HUDI-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2837: - Epic Link: HUDI-1658 > The original hoodie.table.name should be maintained in Spark SQL > > > Key: HUDI-2837 > URL: https://issues.apache.org/jira/browse/HUDI-2837 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > When querying Hudi incrementally in hive, we set the start query time of the > table. This setting works for all tables with the same name, not only for the > tables in the current database. In actual business, it can not be guaranteed > that the tables in different databases are different, so it can be realized > by setting hoodie.table.name as database name + table name, However, at > present, the original value of hoodie.table.name is not consistent in Spark > SQL -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2837) The original hoodie.table.name should be maintained in Spark SQL
[ https://issues.apache.org/jira/browse/HUDI-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2837: - Reviewers: Raymond Xu, Yann Byron (was: Raymond Xu, Yann Byron) > The original hoodie.table.name should be maintained in Spark SQL > > > Key: HUDI-2837 > URL: https://issues.apache.org/jira/browse/HUDI-2837 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > When querying Hudi incrementally in hive, we set the start query time of the > table. This setting works for all tables with the same name, not only for the > tables in the current database. In actual business, it can not be guaranteed > that the tables in different databases are different, so it can be realized > by setting hoodie.table.name as database name + table name, However, at > present, the original value of hoodie.table.name is not consistent in Spark > SQL -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3282) Fix delete exception for Spark SQL when sync Hive
[ https://issues.apache.org/jira/browse/HUDI-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 董可伦 updated HUDI-3282: -- Description: ``` hudi 0.11.0 master build spark: 2.4.5 ``` ```bash hive create database test_hudi; ``` ```scala spark-shell --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 --executor-cores 2 --driver-memory 4G --driver-cores 2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --principal .. --keytab .. import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.QuickstartUtils.\{DataGenerator, convertToStringList, getQuickstartWriteConfigs} import org.apache.hudi.config.HoodieWriteConfig.TBL_NAME import org.apache.spark.sql.SaveMode._ import org.apache.spark.sql.\{SaveMode, SparkSession} import org.apache.spark.sql.functions.lit import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.keygen.SimpleKeyGenerator import org.apache.hudi.common.model.\{DefaultHoodieRecordPayload, HoodiePayloadProps} import org.apache.hudi.io.HoodieMergeHandle import org.apache.hudi.common.table.HoodieTableConfig import org.apache.spark.sql.functions._ import spark.implicits._ val df = Seq((1, "a1", 10, 1000, "2022-01-19")).toDF("id", "name", "value", "ts", "dt") df.write.format("hudi"). option(HoodieWriteConfig.TBL_NAME.key, "test_hudi_table_sync_hive"). option(TABLE_TYPE.key, COW_TABLE_TYPE_OPT_VAL). option(RECORDKEY_FIELD.key, "id"). option(PRECOMBINE_FIELD.key, "ts"). option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.NonpartitionedKeyGenerator"). option("hoodie.datasource.write.partitionpath.field", ""). option("hoodie.metadata.enable", false). option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.ComplexKeyGenerator"). option(META_SYNC_ENABLED.key(), true). option(HIVE_USE_JDBC.key(), false). option(HIVE_DATABASE.key(), "test_hudi"). option(HIVE_AUTO_CREATE_DATABASE.key(), true). option(HIVE_TABLE.key(), "test_hudi_table_sync_hive"). option(HIVE_PARTITION_EXTRACTOR_CLASS.key(), "org.apache.hudi.hive.MultiPartKeysValueExtractor"). mode("overwrite"). save("/test_hudi/test_hudi_table_sync_hive") ``` ``` # hoodie.properties hoodie.table.precombine.field=ts hoodie.table.partition.fields= hoodie.table.type=COPY_ON_WRITE hoodie.archivelog.folder=archived hoodie.populate.meta.fields=true hoodie.timeline.layout.version=1 hoodie.table.version=3 hoodie.table.recordkey.fields=id hoodie.table.base.file.format=PARQUET hoodie.table.timeline.timezone=LOCAL hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator hoodie.table.name=test_hudi_table_sync_hive hoodie.datasource.write.hive_style_partitioning=false ``` hive ```sql show create table test_hudi_table_sync_hive; ++ | createtab_stmt | ++ | CREATE EXTERNAL TABLE `test_hudi_table_sync_hive`( | | `_hoodie_commit_time` string, | | `_hoodie_commit_seqno` string, | | `_hoodie_record_key` string, | | `_hoodie_partition_path` string, | | `_hoodie_file_name` string, | | `id` int, | | `name` string, | | `value` int, | | `ts` int, | | `dt` string) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' | | WITH SERDEPROPERTIES ( | | 'hoodie.query.as.ro.table'='false', | | 'path'='/test_hudi/test_hudi_table_sync_hive') | | STORED AS INPUTFORMAT | | 'org.apache.hudi.hadoop.HoodieParquetInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' | | LOCATION | | 'hdfs://cluster1/test_hudi/test_hudi_table_sync_hive' | | TBLPROPERTIES ( | | 'last_commit_time_sync'='20220119110215185', | | 'spark.sql.sources.provider'='hudi', | | 'spark.sql.sources.schema.numParts'='1', | | 'spark.sql.sources.schema.part.0'='\{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},\
[jira] [Updated] (HUDI-3282) Fix delete exception for Spark SQL when sync Hive
[ https://issues.apache.org/jira/browse/HUDI-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 董可伦 updated HUDI-3282: -- Description: {{hudi 0.11.0 master build spark: 2.4.5}} hive create database test_hudi; spark-shell --master yarn --deploy-mode client --executor-memory 2G --num-executors 3 --executor-cores 2 --driver-memory 4G --driver-cores 2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --principal .. --keytab .. import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.QuickstartUtils.{DataGenerator, convertToStringList, getQuickstartWriteConfigs} import org.apache.hudi.config.HoodieWriteConfig.TBL_NAME import org.apache.spark.sql.SaveMode._ import org.apache.spark.sql.{SaveMode, SparkSession} import org.apache.spark.sql.functions.lit import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.keygen.SimpleKeyGenerator import org.apache.hudi.common.model.{DefaultHoodieRecordPayload, HoodiePayloadProps} import org.apache.hudi.io.HoodieMergeHandle import org.apache.hudi.common.table.HoodieTableConfig import org.apache.spark.sql.functions._ import spark.implicits._ val df = Seq((1, "a1", 10, 1000, "2022-01-19")).toDF("id", "name", "value", "ts", "dt") df.write.format("hudi"). option(HoodieWriteConfig.TBL_NAME.key, "test_hudi_table_sync_hive"). option(TABLE_TYPE.key, COW_TABLE_TYPE_OPT_VAL). option(RECORDKEY_FIELD.key, "id"). option(PRECOMBINE_FIELD.key, "ts"). option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.NonpartitionedKeyGenerator"). option("hoodie.datasource.write.partitionpath.field", ""). option("hoodie.metadata.enable", false). option(KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.ComplexKeyGenerator"). option(META_SYNC_ENABLED.key(), true). option(HIVE_USE_JDBC.key(), false). option(HIVE_DATABASE.key(), "test_hudi"). option(HIVE_AUTO_CREATE_DATABASE.key(), true). option(HIVE_TABLE.key(), "test_hudi_table_sync_hive"). option(HIVE_PARTITION_EXTRACTOR_CLASS.key(), "org.apache.hudi.hive.MultiPartKeysValueExtractor"). mode("overwrite"). save("/test_hudi/test_hudi_table_sync_hive") {{# hoodie.properties hoodie.table.precombine.field=ts hoodie.table.partition.fields= hoodie.table.type=COPY_ON_WRITE hoodie.archivelog.folder=archived hoodie.populate.meta.fields=true hoodie.timeline.layout.version=1 hoodie.table.version=3 hoodie.table.recordkey.fields=id hoodie.table.base.file.format=PARQUET hoodie.table.timeline.timezone=LOCAL hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator hoodie.table.name=test_hudi_table_sync_hive hoodie.datasource.write.hive_style_partitioning=false}} hive show create table test_hudi_table_sync_hive; ++ | createtab_stmt | ++ | CREATE EXTERNAL TABLE `test_hudi_table_sync_hive`( | | `_hoodie_commit_time` string, | | `_hoodie_commit_seqno` string, | | `_hoodie_record_key` string, | | `_hoodie_partition_path` string, | | `_hoodie_file_name` string, | | `id` int, | | `name` string, | | `value` int, | | `ts` int, | | `dt` string) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' | | WITH SERDEPROPERTIES ( | | 'hoodie.query.as.ro.table'='false', | | 'path'='/test_hudi/test_hudi_table_sync_hive') | | STORED AS INPUTFORMAT | | 'org.apache.hudi.hadoop.HoodieParquetInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' | | LOCATION | | 'hdfs://cluster1/test_hudi/test_hudi_table_sync_hive' | | TBLPROPERTIES ( | | 'last_commit_time_sync'='20220119110215185', | | 'spark.sql.sources.provider'='hudi', | | 'spark.sql.sources.schema.numParts'='1', | | 'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[\{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},\{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},\{"name":"id","type":"integer","nullable":false,"metadata":{}},\{"name":"name","type":"string","nullable":true,"metadata":{}},\{"name":"value","type":"integer","nullable":false,"metadata":{}},\{"name":"ts","type":"integer","nullable":false,"metadata":{}},\{"name":"dt","type":"string","nullable":true,"metadata":{}}]}', | | 'transient_lastDdlTime'='1642561355') | ++ 28 rows selected (0.429 seconds) spark-sql --master yarn --deploy-mode client --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spar
[jira] [Updated] (HUDI-3088) Make Spark 3 the default profile for build and test
[ https://issues.apache.org/jira/browse/HUDI-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3088: - Story Points: 0.5 > Make Spark 3 the default profile for build and test > --- > > Key: HUDI-3088 > URL: https://issues.apache.org/jira/browse/HUDI-3088 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Raymond Xu >Assignee: Yann Byron >Priority: Blocker > Fix For: 0.11.0 > > > By default, when people check out the code, they should have activated spark > 3 for the repo. Also all tests should be running against the latest supported > spark version. Correspondingly the default scala version becomes 2.12 and the > default parquet version 1.12. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3161) Add Call Produce Command for spark sql
[ https://issues.apache.org/jira/browse/HUDI-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3161: - Epic Link: HUDI-1658 > Add Call Produce Command for spark sql > -- > > Key: HUDI-3161 > URL: https://issues.apache.org/jira/browse/HUDI-3161 > Project: Apache Hudi > Issue Type: New Feature > Components: spark-sql >Reporter: Forward Xu >Assignee: Forward Xu >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > example > {code:java} > // code placeholder > # Produce1 > call show_commits_metadata(table => 'test_hudi_table'); > commit_time action partition file_id previous_commit num_writes > num_inserts num_deletes num_update_writes total_errors > total_log_blockstotal_corrupt_logblocks total_rollback_blocks > total_log_records total_updated_records_compacted total_bytes_written > 20220109225319449 commit dt=2021-05-03 > d0073a12-085d-4f49-83e9-402947e7e90a-0 null1 1 0 0 > 0 0 0 0 0 0 435349 > 20220109225311742 commit dt=2021-05-02 > b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0 20220109214830592 1 1 > 0 0 0 0 0 0 0 0 435340 > 20220109225301429 commit dt=2021-05-01 > 0d7298b3-6b55-4cff-8d7d-b0772358b78a-0 20220109214830592 1 1 > 0 0 0 0 0 0 0 0 435340 > 20220109214830592 commit dt=2021-05-01 > 0d7298b3-6b55-4cff-8d7d-b0772358b78a-0 20220109191631015 0 0 > 1 0 0 0 0 0 0 0 432653 > 20220109214830592 commit dt=2021-05-02 > b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0 20220109191648181 0 0 > 1 0 0 0 0 0 0 0 432653 > 20220109191648181 commit dt=2021-05-02 > b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0 null1 1 0 0 > 0 0 0 0 0 0 435341 > 20220109191631015 commit dt=2021-05-01 > 0d7298b3-6b55-4cff-8d7d-b0772358b78a-0 null1 1 0 0 > 0 0 0 0 0 0 435341 > Time taken: 0.844 seconds, Fetched 7 row(s) > # Produce2 > call rollback_to_instant(table => 'test_hudi_table', instant_time => > '20220109225319449'); > rollback_result > true > Time taken: 5.038 seconds, Fetched 1 row(s) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot commented on pull request #4646: [HUDI-3250] Upgrade Presto docker image
hudi-bot commented on pull request #4646: URL: https://github.com/apache/hudi/pull/4646#issuecomment-1017157186 ## CI report: * 097d027ebdd6cdb3246dc6dbe4aba8f00122e5e6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5363) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4646: [HUDI-3250] Upgrade Presto docker image
hudi-bot removed a comment on pull request #4646: URL: https://github.com/apache/hudi/pull/4646#issuecomment-1017155666 ## CI report: * 097d027ebdd6cdb3246dc6dbe4aba8f00122e5e6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4646: [HUDI-3250] Upgrade Presto docker image
hudi-bot commented on pull request #4646: URL: https://github.com/apache/hudi/pull/4646#issuecomment-1017155666 ## CI report: * 097d027ebdd6cdb3246dc6dbe4aba8f00122e5e6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive
dongkelun commented on pull request #3745: URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017155450 In order to avoid confusion and ambiguity, I have retreated to the historical commitid:[071b13b](https://github.com/apache/hudi/commit/071b13bfe3ebee1875d12432e550c8718566bfd4) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3250) Upgrade Presto version in docker setup and integ test
[ https://issues.apache.org/jira/browse/HUDI-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-3250: - Labels: pull-request-available (was: ) > Upgrade Presto version in docker setup and integ test > - > > Key: HUDI-3250 > URL: https://issues.apache.org/jira/browse/HUDI-3250 > Project: Apache Hudi > Issue Type: Test > Components: trino-presto >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] codope opened a new pull request #4646: [HUDI-3250] Upgrade Presto docker image
codope opened a new pull request #4646: URL: https://github.com/apache/hudi/pull/4646 ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive
hudi-bot commented on pull request #3745: URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017147751 ## CI report: * 1cd9e56e69b15edac5a12afdb9626c853eb6e83f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5343) * 942072ece4f6fc46251f922eabe4f6bdac767410 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5362) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive
hudi-bot removed a comment on pull request #3745: URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017146397 ## CI report: * 1cd9e56e69b15edac5a12afdb9626c853eb6e83f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5343) * 942072ece4f6fc46251f922eabe4f6bdac767410 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive
hudi-bot commented on pull request #3745: URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017146397 ## CI report: * 1cd9e56e69b15edac5a12afdb9626c853eb6e83f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5343) * 942072ece4f6fc46251f922eabe4f6bdac767410 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #3745: [HUDI-2514] Add default hiveTableSerdeProperties for Spark SQL when sync Hive
hudi-bot removed a comment on pull request #3745: URL: https://github.com/apache/hudi/pull/3745#issuecomment-1016217784 ## CI report: * 1cd9e56e69b15edac5a12afdb9626c853eb6e83f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5343) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (a08a2b7 -> 31b57a2)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from a08a2b7 [MINOR] Add instructions to build and upload Docker Demo images (#4612) add 31b57a2 [HUDI-3236] use fields'comments persisted in catalog to fill in schema (#4587) No new revisions were added by this update. Summary of changes: .../sql/catalyst/catalog/HoodieCatalogTable.scala | 31 +-- .../spark/sql/hudi/HoodieSqlCommonUtils.scala | 13 +++- .../AlterHoodieTableChangeColumnCommand.scala | 16 ++ .../AlterHoodieTableDropPartitionCommand.scala | 2 +- .../command/ShowHoodieTablePartitionsCommand.scala | 4 +-- .../org/apache/spark/sql/hudi/TestAlterTable.scala | 36 +- 6 files changed, 75 insertions(+), 27 deletions(-)
[jira] [Closed] (HUDI-3236) ALTER TABLE COMMENT old comment gets reverted
[ https://issues.apache.org/jira/browse/HUDI-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-3236. Reviewers: Raymond Xu, Tao Meng (was: Raymond Xu) Resolution: Fixed > ALTER TABLE COMMENT old comment gets reverted > - > > Key: HUDI-3236 > URL: https://issues.apache.org/jira/browse/HUDI-3236 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql >Affects Versions: 0.10.1 >Reporter: Raymond Xu >Assignee: Yann Byron >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > Original Estimate: 0.5h > Time Spent: 0.5h > Remaining Estimate: 0h > > {code:sql} > create table if not exists cow_nonpt_nonpcf_tbl ( > id int, > name string, > price double > ) using hudi > options ( > type = 'cow', > primaryKey = 'id' > ); > insert into cow_nonpt_nonpcf_tbl select 1, 'a1', 20; > ALTER TABLE cow_nonpt_nonpcf_tbl alter column id comment "primary id"; > DESC cow_nonpt_nonpcf_tbl; > -- this works fine so far > ALTER TABLE cow_nonpt_nonpcf_tbl alter column name comment "name column"; > DESC cow_nonpt_nonpcf_tbl; > -- this saves the comment for name column > -- but comment for id column was reverted back to NULL > {code} > reported while testing on 0.10.1-rc1 (spark 3.0.3, 3.1.2) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] xushiyan merged pull request #4587: [HUDI-3236] use fields'comments persisted in catalog to fill in schema
xushiyan merged pull request #4587: URL: https://github.com/apache/hudi/pull/4587 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] harishraju-govindaraju edited a comment on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working
harishraju-govindaraju edited a comment on issue #4641: URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017138248 Tried to define proper schema. Still having same error . Any help is much appreciated as we are planning to use deltastreamer in production. Caused by: org.apache.hudi.exception.HoodieIOException: Unrecognized token 'Objavro': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false') at [Source: (String)"Objavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["string","null"]},{"name":"creation_date","type":["string","null"]},{"name":"last_update_time","type":["string","null"]},{"name":"quantity","type":["string","null"]},{"name":"compcode","type":["string","null"]}]}0org.apache.spark.version"; line: 1, column: 11] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] harishraju-govindaraju commented on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working
harishraju-govindaraju commented on issue #4641: URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017138248 tried to define proper schema. Still having same error Caused by: org.apache.hudi.exception.HoodieIOException: Unrecognized token 'Objavro': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false') at [Source: (String)"Objavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["string","null"]},{"name":"creation_date","type":["string","null"]},{"name":"last_update_time","type":["string","null"]},{"name":"quantity","type":["string","null"]},{"name":"compcode","type":["string","null"]}]}0org.apache.spark.version"; line: 1, column: 11] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4287: [DO NOT MERGE] 0.10.0 release patch for flink
hudi-bot removed a comment on pull request #4287: URL: https://github.com/apache/hudi/pull/4287#issuecomment-1017101009 ## CI report: * 5b7a535559d80359a3febc2d1a80bf9a8ac20cf9 UNKNOWN * 952a154b1c656cd8e3c9c0df9fee313d3890d938 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5319) * a9254d5c5059e77f883467f05dedf08d704b17f1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5359) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4287: [DO NOT MERGE] 0.10.0 release patch for flink
hudi-bot commented on pull request #4287: URL: https://github.com/apache/hudi/pull/4287#issuecomment-1017133273 ## CI report: * 5b7a535559d80359a3febc2d1a80bf9a8ac20cf9 UNKNOWN * a9254d5c5059e77f883467f05dedf08d704b17f1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5359) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4645: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target …
hudi-bot commented on pull request #4645: URL: https://github.com/apache/hudi/pull/4645#issuecomment-1017128365 ## CI report: * 5332458bfb61a6e13b9b59ae3813d236f86e01da Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5358) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4645: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target …
hudi-bot removed a comment on pull request #4645: URL: https://github.com/apache/hudi/pull/4645#issuecomment-1017099833 ## CI report: * 5332458bfb61a6e13b9b59ae3813d236f86e01da Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5358) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] harishraju-govindaraju edited a comment on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working
harishraju-govindaraju edited a comment on issue #4641: URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017122913 Hello @nsivabalan , Thanks for promptly responding to my question. I tried to clear the folder and reran the below spark-submit command. The folder .hoodie got created but the job ended with error with no data files. **_Unrecognized token 'Objavro': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false') at [Source: (String)"Objavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["string","null"]},{"name":"creation_date","type":["string","null"]},{"name":"last_update_time","type":["string","null"]},{"name":"quantity","type":["string","null"]},{"name":"compcode","type":["string","null"]}]}0org.apache.spark.version"; line: 1, column: 11]_** spark-submit \ --jars "s3://zcustomjar/spark-avro_2.11-2.4.4.jar" \ --deploy-mode "client" \ --class "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer" /usr/lib/hudi/hudi-utilities-bundle.jar \ --schemaprovider-class "org.apache.hudi.utilities.schema.FilebasedSchemaProvider" \ --table-type COPY_ON_WRITE \ --source-ordering-field id \ --target-base-path s3://ztrusted1/default/hudi-table1/ --target-table hudi-table1 \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator \ --hoodie-conf hoodie.datasource.write.recordkey.field=id \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://zlanding1/input1/ \ --hoodie-conf hoodie.datasource.write.partitionpath.field=compcode \ --hoodie-conf hoodie.datasource.write.operation=insert \ --hoodie-conf hoodie.deltastreamer.schemaprovider.source.schema.file=s3://zcustomjar/source2.avsc \ --hoodie-conf hoodie.deltastreamer.schemaprovider.target.schema.file=s3://zcustomjar/target.avsc \ I have manually created the schema .avsc file using notepad. Not sure if that is a problem. { "type" : "record", "name" : "triprec", "fields" : [ { "name" : "id", "type" : "string" }, { "name" : "creation_date", "type" : "string" }, { "name" : "last_update_time", "type" : "string" }, { "name" : "quantity", "type" : "string" }, { "name" : "compcode", "type" : "string" }] } -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] harishraju-govindaraju commented on issue #4641: [SUPPORT] - HudiDeltaStreamer - EMR - SparkSubmit Not working
harishraju-govindaraju commented on issue #4641: URL: https://github.com/apache/hudi/issues/4641#issuecomment-1017122913 Hello @nsivabalan , Thanks for promptly responding to my question. I tried to clear the folder and reran the below spark-submit command. The folder .hoodie got created but the job ended with error with no data files. Unrecognized token 'Objavro': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false') at [Source: (String)"Objavro.schema�{"type":"record","name":"topLevelRecord","fields":[{"name":"id","type":["string","null"]},{"name":"creation_date","type":["string","null"]},{"name":"last_update_time","type":["string","null"]},{"name":"quantity","type":["string","null"]},{"name":"compcode","type":["string","null"]}]}0org.apache.spark.version"; line: 1, column: 11] spark-submit \ --jars "s3://zcustomjar/spark-avro_2.11-2.4.4.jar" \ --deploy-mode "client" \ --class "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer" /usr/lib/hudi/hudi-utilities-bundle.jar \ --schemaprovider-class "org.apache.hudi.utilities.schema.FilebasedSchemaProvider" \ --table-type COPY_ON_WRITE \ --source-ordering-field id \ --target-base-path s3://ztrusted1/default/hudi-table1/ --target-table hudi-table1 \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator \ --hoodie-conf hoodie.datasource.write.recordkey.field=id \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://zlanding1/input1/ \ --hoodie-conf hoodie.datasource.write.partitionpath.field=compcode \ --hoodie-conf hoodie.datasource.write.operation=insert \ --hoodie-conf hoodie.deltastreamer.schemaprovider.source.schema.file=s3://zcustomjar/source2.avsc \ --hoodie-conf hoodie.deltastreamer.schemaprovider.target.schema.file=s3://zcustomjar/target.avsc \ I have manually created the schema .avsc file using notepad. Not sure if that is a problem. { "type" : "record", "name" : "triprec", "fields" : [ { "name" : "id", "type" : "string" }, { "name" : "creation_date", "type" : "string" }, { "name" : "last_update_time", "type" : "string" }, { "name" : "quantity", "type" : "string" }, { "name" : "compcode", "type" : "string" }] } -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LucassLin commented on issue #4642: [SUPPORT] Hudi Merge Into
LucassLin commented on issue #4642: URL: https://github.com/apache/hudi/issues/4642#issuecomment-1017119391 > ```scala > historicalDF.write.format("hudi").saveAsTable("tableName") > ``` > > Sorry, I didn't see this one just now. It's OK from Hudi version 0.9.0, because there's still hudi sparkSql > > In addition, it is also possible to configure sync hive, but there are bugs in previous versions. See this PR for details [3745](https://github.com/apache/hudi/pull/3745) > > Of course, you can also create a hive table. As long as the attributes are completely consistent, it is essentially the same as sync hive thanks for the replies. I tried using hudi createTable sql command but getting ``` Exception = MetaException(message:Got exception: java.io.IOException Error accessing gs://* ``` I also tried using saveAsTable but seems like there might be some issue with hive config which causes ``` Exception = Invalid host name: local host is: ``` I will try to resolve these once I get back to work and see if the entity table would solve the mergeInto issue. Thanks again for your help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] RocMarshal closed pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.
RocMarshal closed pull request #3813: URL: https://github.com/apache/hudi/pull/3813 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a change in pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field
prashantwason commented on a change in pull request #4449: URL: https://github.com/apache/hudi/pull/4449#discussion_r788345996 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java ## @@ -83,6 +84,12 @@ .withDocumentation("Lower values increase the size of metadata tracked within HFile, but can offer potentially " + "faster lookup times."); + public static final ConfigProperty HFILE_SCHEMA_KEY_FIELD_NAME = ConfigProperty Review comment: This setting is broken because the HFileReader does not have a way to use it. Assume I specify this setting to be "someotherkey". The HFileReader will still use the hardcoded "key". I suggest you remove this setting and all associated code and defer this for a later PR which will plug in this setting to the reader. ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java ## @@ -122,7 +128,13 @@ public boolean canWrite() { @Override public void writeAvro(String recordKey, IndexedRecord object) throws IOException { -byte[] value = HoodieAvroUtils.avroToBytes((GenericRecord)object); +byte[] value = HoodieAvroUtils.avroToBytes((GenericRecord) object); +if (schemaRecordKeyField.isPresent()) { + GenericRecord recordKeyExcludedRecord = HoodieAvroUtils.bytesToAvro(value, this.schema); Review comment: This will reduce performance as you are converting the record to bytes in the line above and then immediately parsing it back to the GenericRecord again. If may be better to check first before creating the bytes. ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileConfig.java ## @@ -43,9 +43,10 @@ private final Configuration hadoopConf; private final BloomFilter bloomFilter; private final KeyValue.KVComparator hfileComparator; + private final String schemaKeyFieldId; Review comment: Why is this an Id and not name? schemaKeyFieldName -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4644: [HUDI-3282] Fix delete exception for Spark SQL when sync Hive
hudi-bot removed a comment on pull request #4644: URL: https://github.com/apache/hudi/pull/4644#issuecomment-1017091703 ## CI report: * ce8d15b8d574ce0591dd309dd27bf5e26d0bdef0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5357) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4644: [HUDI-3282] Fix delete exception for Spark SQL when sync Hive
hudi-bot commented on pull request #4644: URL: https://github.com/apache/hudi/pull/4644#issuecomment-1017113200 ## CI report: * ce8d15b8d574ce0591dd309dd27bf5e26d0bdef0 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5357) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun edited a comment on issue #4642: [SUPPORT] Hudi Merge Into
dongkelun edited a comment on issue #4642: URL: https://github.com/apache/hudi/issues/4642#issuecomment-1017091320 @LucassLin In official documents:[https://hudi.apache.org/docs/quick-start-guide/](https://hudi.apache.org/docs/quick-start-guide/) #Create Table # SparkSQL ```sql -- create a mor non-partitioned table without preCombineField provided create table hudi_mor_tbl ( id int, name string, price double, ts bigint ) using hudi tblproperties ( type = 'mor', primaryKey = 'id', preCombineField = 'ts' ); ``` The document is a bit wrong in this place. type ='cow' should be type = 'mor',This parameter only controls the table type -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering
hudi-bot commented on pull request #4606: URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017109276 ## CI report: * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering
hudi-bot removed a comment on pull request #4606: URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017071243 ## CI report: * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264) * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on issue #4642: [SUPPORT] Hudi Merge Into
dongkelun commented on issue #4642: URL: https://github.com/apache/hudi/issues/4642#issuecomment-1017108838 ```scala historicalDF.write.format("hudi").saveAsTable("tableName") ``` Sorry, I didn't see this one just now. It's OK from Hudi version 0.9.0, because there's still hudi sparkSql In addition, it is also possible to configure sync hive, but there are bugs in previous versions. See this PR for details [3745](https://github.com/apache/hudi/pull/3745) Of course, you can also create a hive table. As long as the attributes are completely consistent, it is essentially the same as sync hive -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [MINOR] Add instructions to build and upload Docker Demo images (#4612)
This is an automated email from the ASF dual-hosted git repository. codope pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new a08a2b7 [MINOR] Add instructions to build and upload Docker Demo images (#4612) a08a2b7 is described below commit a08a2b730674d57a0c40545023b21f0bac34e7df Author: Y Ethan Guo AuthorDate: Wed Jan 19 20:25:28 2022 -0800 [MINOR] Add instructions to build and upload Docker Demo images (#4612) * [MINOR] Add instructions to build and upload Docker Demo images * Add local test instruction --- docker/README.md | 93 ++ docker/push_to_docker_hub.png | Bin 0 -> 260126 bytes docker/setup_demo.sh | 6 ++- 3 files changed, 98 insertions(+), 1 deletion(-) diff --git a/docker/README.md b/docker/README.md new file mode 100644 index 000..19293de --- /dev/null +++ b/docker/README.md @@ -0,0 +1,93 @@ + + +# Docker Demo for Hudi + +This repo contains the docker demo resources for building docker demo images, set up the demo, and running Hudi in the +docker demo environment. + +## Repo Organization + +### Configs for assembling docker images - `/hoodie` + +The `/hoodie` folder contains all the configs for assembling necessary docker images. The name and repository of each +docker image, e.g., `apachehudi/hudi-hadoop_2.8.4-trinobase_368`, is defined in the maven configuration file `pom.xml`. + +### Docker compose config for the Demo - `/compose` + +The `/compose` folder contains the yaml file to compose the Docker environment for running Hudi Demo. + +### Resources and Sample Data for the Demo - `/demo` + +The `/demo` folder contains useful resources and sample data use for the Demo. + +## Build and Test Image locally + +To build all docker images locally, you can run the script: + +```shell +./build_local_docker_images.sh +``` + +To build a single image target, you can run + +```shell +mvn clean pre-integration-test -DskipTests -Ddocker.compose.skip=true -Ddocker.build.skip=false -pl : -am +# For example, to build hudi-hadoop-trinobase-docker +mvn clean pre-integration-test -DskipTests -Ddocker.compose.skip=true -Ddocker.build.skip=false -pl :hudi-hadoop-trinobase-docker -am +``` + +Alternatively, you can use `docker` cli directly under `hoodie/hadoop`. Note that, you need to manually name your local +image by using `-t` option to match the naming in the `pom.xml`, so that you can update the corresponding image +repository in Docker Hub (detailed steps in the next section). + +```shell +# Run under hoodie/hadoop, the is optional, "latest" by default +docker build -t /[:] +# For example, to build trinobase +docker build trinobase -t apachehudi/hudi-hadoop_2.8.4-trinobase_368 +``` + +After new images are built, you can run the following script to bring up docker demo with your local images: + +```shell +./setup_demo.sh dev +``` + +## Upload Updated Image to Repository on Docker Hub + +Once you have built the updated image locally, you can push the corresponding this repository of the image to the Docker +Hud registry designated by its name or tag: + +```shell +docker push /: +# For example +docker push apachehudi/hudi-hadoop_2.8.4-trinobase_368 +``` + +You can also easily push the image to the Docker Hub using Docker Desktop app: go to `Images`, search for the image by +the name, and then click on the three dots and `Push to Hub`. + +![Push to Docker Hub](push_to_docker_hub.png) + +Note that you need to ask for permission to upload the Hudi Docker Demo images to the repositories. + +You can find more information on [Docker Hub Repositories Manual](https://docs.docker.com/docker-hub/repos/). + +## Docker Demo Setup + +Please refer to the [Docker Demo Docs page](https://hudi.apache.org/docs/docker_demo). \ No newline at end of file diff --git a/docker/push_to_docker_hub.png b/docker/push_to_docker_hub.png new file mode 100644 index 000..faa431b Binary files /dev/null and b/docker/push_to_docker_hub.png differ diff --git a/docker/setup_demo.sh b/docker/setup_demo.sh index 634fe9e..9f0a100 100755 --- a/docker/setup_demo.sh +++ b/docker/setup_demo.sh @@ -17,10 +17,14 @@ # limitations under the License. SCRIPT_PATH=$(cd `dirname $0`; pwd) +HUDI_DEMO_ENV=$1 WS_ROOT=`dirname $SCRIPT_PATH` # restart cluster HUDI_WS=${WS_ROOT} docker-compose -f ${SCRIPT_PATH}/compose/docker-compose_hadoop284_hive233_spark244.yml down -HUDI_WS=${WS_ROOT} docker-compose -f ${SCRIPT_PATH}/compose/docker-compose_hadoop284_hive233_spark244.yml pull +if [ "$HUDI_DEMO_ENV" != "dev" ]; then + echo "Pulling docker demo images ..." + HUDI_WS=${WS_ROOT} docker-compose -f ${SCRIPT_PATH}/compose/docker-compose_hadoop284_hive233_spark244.yml pull +fi sleep 5 HUDI_WS=${WS_ROOT} docker-compose -f ${SCRIPT_PATH}/compose/docker-compose_hadoop284_hive233_spark244.yml up -d sleep 15
[GitHub] [hudi] codope merged pull request #4612: [MINOR] Add instructions to build and upload Docker Demo images
codope merged pull request #4612: URL: https://github.com/apache/hudi/pull/4612 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4287: [DO NOT MERGE] 0.10.0 release patch for flink
hudi-bot commented on pull request #4287: URL: https://github.com/apache/hudi/pull/4287#issuecomment-1017101009 ## CI report: * 5b7a535559d80359a3febc2d1a80bf9a8ac20cf9 UNKNOWN * 952a154b1c656cd8e3c9c0df9fee313d3890d938 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5319) * a9254d5c5059e77f883467f05dedf08d704b17f1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5359) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4287: [DO NOT MERGE] 0.10.0 release patch for flink
hudi-bot removed a comment on pull request #4287: URL: https://github.com/apache/hudi/pull/4287#issuecomment-1017093744 ## CI report: * 5b7a535559d80359a3febc2d1a80bf9a8ac20cf9 UNKNOWN * 952a154b1c656cd8e3c9c0df9fee313d3890d938 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5319) * a9254d5c5059e77f883467f05dedf08d704b17f1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4645: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target …
hudi-bot commented on pull request #4645: URL: https://github.com/apache/hudi/pull/4645#issuecomment-1017099833 ## CI report: * 5332458bfb61a6e13b9b59ae3813d236f86e01da Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5358) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4645: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target …
hudi-bot removed a comment on pull request #4645: URL: https://github.com/apache/hudi/pull/4645#issuecomment-1017098255 ## CI report: * 5332458bfb61a6e13b9b59ae3813d236f86e01da UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4645: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target …
hudi-bot commented on pull request #4645: URL: https://github.com/apache/hudi/pull/4645#issuecomment-1017098255 ## CI report: * 5332458bfb61a6e13b9b59ae3813d236f86e01da UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] watermelon12138 opened a new pull request #4645: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target …
watermelon12138 opened a new pull request #4645: URL: https://github.com/apache/hudi/pull/4645 ## What is the purpose of the pull request Enable MultiTableDeltaStreamer to update a single target table from multiple source tables. ## Brief change log - *Modify the HoodieMultiTableDeltaStreamer file so that it can generate the execution context of table based on source tables.* - *Modify the DeltaSync.java file so that the source table can associate with other tables and the source can configure independent checkpoint.* - *add UT.* -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3279) Metadata table stores incorrect file sizes after Restore
[ https://issues.apache.org/jira/browse/HUDI-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3279: -- Attachment: Screen Shot 2022-01-19 at 7.56.37 PM.png > Metadata table stores incorrect file sizes after Restore > > > Key: HUDI-3279 > URL: https://issues.apache.org/jira/browse/HUDI-3279 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > Attachments: Screen Shot 2022-01-19 at 12.17.21 PM.png, Screen Shot > 2022-01-19 at 12.18.27 PM.png, Screen Shot 2022-01-19 at 7.56.37 PM.png > > > While working on [https://github.com/apache/hudi/pull/4556,] I have stumbled > upon an issue of the LogBlock Scanner EOF-ing on the log-files in tests after > performing Restore operation. > The root-cause of these turned out to be Metadata Table storing incorrect > sizes of the files after Restore (sizes in MT are essentially 2x of what is > in FS): > !Screen Shot 2022-01-19 at 12.17.21 PM.png! > !Screen Shot 2022-01-19 at 12.18.27 PM.png! > > This seems to occur due to following: > # Metadata table treats new Records for the same file as "deltas", appending > the file-size to its records > (https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java#L227)] > # Upon Restore (which is handled simply as a collection of Rollbacks) we > pick *max* of the sizes of the files before and after the operation, not > regarding to which we're actually rolling back to > (https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java#L254).] > > *Proposal* > Instead of simply always picking the max size, we should pick the size of the > file as it was right before. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3279) Metadata table stores incorrect file sizes after Restore
[ https://issues.apache.org/jira/browse/HUDI-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479086#comment-17479086 ] Alexey Kudinkin commented on HUDI-3279: --- This is an example of test failing in CI: [https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=5351&view=logs&j=7601efb9-4019-552e-11ba-eb31b66593b2&t=9688f101-287d-53f4-2a80-87202516f5d0&l=4344] !Screen Shot 2022-01-19 at 7.56.37 PM.png! > Metadata table stores incorrect file sizes after Restore > > > Key: HUDI-3279 > URL: https://issues.apache.org/jira/browse/HUDI-3279 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > Attachments: Screen Shot 2022-01-19 at 12.17.21 PM.png, Screen Shot > 2022-01-19 at 12.18.27 PM.png, Screen Shot 2022-01-19 at 7.56.37 PM.png > > > While working on [https://github.com/apache/hudi/pull/4556,] I have stumbled > upon an issue of the LogBlock Scanner EOF-ing on the log-files in tests after > performing Restore operation. > The root-cause of these turned out to be Metadata Table storing incorrect > sizes of the files after Restore (sizes in MT are essentially 2x of what is > in FS): > !Screen Shot 2022-01-19 at 12.17.21 PM.png! > !Screen Shot 2022-01-19 at 12.18.27 PM.png! > > This seems to occur due to following: > # Metadata table treats new Records for the same file as "deltas", appending > the file-size to its records > (https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java#L227)] > # Upon Restore (which is handled simply as a collection of Rollbacks) we > pick *max* of the sizes of the files before and after the operation, not > regarding to which we're actually rolling back to > (https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java#L254).] > > *Proposal* > Instead of simply always picking the max size, we should pick the size of the > file as it was right before. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] stayrascal commented on a change in pull request #4141: [HUDI-2815] Support partial update for streaming change logs
stayrascal commented on a change in pull request #4141: URL: https://github.com/apache/hudi/pull/4141#discussion_r788331396 ## File path: hudi-common/src/main/java/org/apache/hudi/common/model/PartialUpdateWithLatestAvroPayload.java ## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.model; + +import org.apache.hudi.common.util.Option; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.generic.IndexedRecord; + +import java.io.IOException; +import java.util.List; +import java.util.Objects; +import java.util.Properties; + +import static org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro; + +/** + * The only difference with {@link DefaultHoodieRecordPayload} is that support update partial fields + * in latest record which value is not null to existing record instead of all fields. + * + * Assuming a {@link GenericRecord} has three fields: a int , b int, c int. The first record value: 1, 2, 3. + * The second record value is: 4, 5, null, the field c value is null. After call the combineAndGetUpdateValue method, + * we will get final record value: 4, 5, 3, field c value will not be overwritten because its value is null in latest record. + */ +public class PartialUpdateWithLatestAvroPayload extends DefaultHoodieRecordPayload { + Review comment: Hi @danny0405 , if I didn't understand wrongly, the `preCombine` method will be called during deduplicate record by `flushBucket` or `flushRemaining` in `SteamWriteFucntion`, which will only deduplicate the records(new created/updated record) in the buffer. If we only overwrite `preCombine`, which will only update/merge the records in the buffer, but if there is a record with same recordKey existed in base/log file, the record will be overwrote by the new merged record from buffer, right? For example, in COW mode, we might still need to overwrite `combineAndGetUpdateValue` method, because it will be called by `HoodieMergeHandle.write(GenericRecord oldRecord)`, this method will merge the new merged record with old records. ``` public void write(GenericRecord oldRecord) { String key = KeyGenUtils.getRecordKeyFromGenericRecord(oldRecord, keyGeneratorOpt); boolean copyOldRecord = true; if (keyToNewRecords.containsKey(key)) { // If we have duplicate records that we are updating, then the hoodie record will be deflated after // writing the first record. So make a copy of the record to be merged HoodieRecord hoodieRecord = new HoodieRecord<>(keyToNewRecords.get(key)); try { Option combinedAvroRecord = hoodieRecord.getData().combineAndGetUpdateValue(oldRecord, useWriterSchema ? tableSchemaWithMetaFields : tableSchema, config.getPayloadConfig().getProps()); if (combinedAvroRecord.isPresent() && combinedAvroRecord.get().equals(IGNORE_RECORD)) { // If it is an IGNORE_RECORD, just copy the old record, and do not update the new record. copyOldRecord = true; } else if (writeUpdateRecord(hoodieRecord, oldRecord, combinedAvroRecord)) { /* * ONLY WHEN 1) we have an update for this key AND 2) We are able to successfully * write the the combined new * value * * We no longer need to copy the old record over. */ copyOldRecord = false; } writtenRecordKeys.add(key); } catch (Exception e) { throw new HoodieUpsertException("Failed to combine/merge new record with old value in storage, for new record {" + keyToNewRecords.get(key) + "}, old value {" + oldRecord + "}", e); } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4287: [DO NOT MERGE] 0.10.0 release patch for flink
hudi-bot commented on pull request #4287: URL: https://github.com/apache/hudi/pull/4287#issuecomment-1017093744 ## CI report: * 5b7a535559d80359a3febc2d1a80bf9a8ac20cf9 UNKNOWN * 952a154b1c656cd8e3c9c0df9fee313d3890d938 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5319) * a9254d5c5059e77f883467f05dedf08d704b17f1 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4287: [DO NOT MERGE] 0.10.0 release patch for flink
hudi-bot removed a comment on pull request #4287: URL: https://github.com/apache/hudi/pull/4287#issuecomment-1015311668 ## CI report: * 5b7a535559d80359a3febc2d1a80bf9a8ac20cf9 UNKNOWN * 952a154b1c656cd8e3c9c0df9fee313d3890d938 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5319) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4644: [HUDI-3282] Fix delete exception for Spark SQL when sync Hive
hudi-bot removed a comment on pull request #4644: URL: https://github.com/apache/hudi/pull/4644#issuecomment-1017090637 ## CI report: * ce8d15b8d574ce0591dd309dd27bf5e26d0bdef0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4644: [HUDI-3282] Fix delete exception for Spark SQL when sync Hive
hudi-bot commented on pull request #4644: URL: https://github.com/apache/hudi/pull/4644#issuecomment-1017091703 ## CI report: * ce8d15b8d574ce0591dd309dd27bf5e26d0bdef0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5357) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on issue #4642: [SUPPORT] Hudi Merge Into
dongkelun commented on issue #4642: URL: https://github.com/apache/hudi/issues/4642#issuecomment-1017091320 @LucassLin In official documents:[https://hudi.apache.org/docs/quick-start-guide/](https://hudi.apache.org/docs/quick-start-guide/) #Create Table # SparkSQL ```sql -- create a mor non-partitioned table without preCombineField provided create table hudi_mor_tbl ( id int, name string, price double, ts bigint ) using hudi tblproperties ( type = 'cow', primaryKey = 'id', preCombineField = 'ts' ); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3122) presto query failed for bootstrap tables
[ https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479084#comment-17479084 ] Yue Zhang commented on HUDI-3122: - Since https://github.com/apache/hudi/pull/4551 is merged maybe we can close this issue? > presto query failed for bootstrap tables > > > Key: HUDI-3122 > URL: https://issues.apache.org/jira/browse/HUDI-3122 > Project: Apache Hudi > Issue Type: Improvement > Components: trino-presto >Reporter: Wenning Ding >Priority: Major > > > {{java.lang.NoClassDefFoundError: > org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137) > at java.util.HashMap.forEach(HashMap.java:1290) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294) > at > java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot commented on pull request #4644: [HUDI-3282] Fix delete exception for Spark SQL when sync Hive
hudi-bot commented on pull request #4644: URL: https://github.com/apache/hudi/pull/4644#issuecomment-1017090637 ## CI report: * ce8d15b8d574ce0591dd309dd27bf5e26d0bdef0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot commented on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1017090493 ## CI report: * 6b59b2758827fad29ac77c2e7ac00ba4ee00cbbd Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5355) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot removed a comment on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1017061971 ## CI report: * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5044) * 6b59b2758827fad29ac77c2e7ac00ba4ee00cbbd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5355) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on pull request #3745: [HUDI-2514] Fix delete exception for Spark SQL when sync Hive
dongkelun commented on pull request #3745: URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017090320 > > @YannByron thanks for the review. can you take it from here please? from the reproducing steps, looks like a different bug where primary key and other properties were not respected by HMS? > > @xushiyan ok, i'll take this. And it'a real different case, better to create another pr and ticket. Otherwise, it can feel strange and confusing. @dongkelun A new PR has been submitted:[4644](https://github.com/apache/hudi/pull/4644) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3282) Fix delete exception for Spark SQL when sync Hive
[ https://issues.apache.org/jira/browse/HUDI-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-3282: - Labels: pull-request-available (was: ) > Fix delete exception for Spark SQL when sync Hive > - > > Key: HUDI-3282 > URL: https://issues.apache.org/jira/browse/HUDI-3282 > Project: Apache Hudi > Issue Type: Bug > Components: hive-sync, spark-sql >Affects Versions: 0.10.0 >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Major > Labels: pull-request-available > Fix For: 0.10.1 > > > h1. Fix delete exception for Spark SQL when sync Hive -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] dongkelun opened a new pull request #4644: [HUDI-3282] Fix delete exception for Spark SQL when sync Hive
dongkelun opened a new pull request #4644: URL: https://github.com/apache/hudi/pull/4644 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request Fix delete exception for Spark SQL when sync Hive ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LucassLin commented on issue #4642: [SUPPORT] Hudi Merge Into
LucassLin commented on issue #4642: URL: https://github.com/apache/hudi/issues/4642#issuecomment-1017089251 > Instead of using a temporary table, try changing the target table to an entity table. The current version should not support the situation that the target table is a temporary table.And the target table must be a Hudi table Thanks for the reply. By entity table, do you mean something like ``` historicalDF.write.saveAsTable("tableName") ``` Can you also elaborate more on "And the target table must be a Hudi table"? How do I ensure the table I write is Hudi table? Is there a specific API to use to create this entity table as hudi table? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-3282) Fix delete exception for Spark SQL when sync Hive
董可伦 created HUDI-3282: - Summary: Fix delete exception for Spark SQL when sync Hive Key: HUDI-3282 URL: https://issues.apache.org/jira/browse/HUDI-3282 Project: Apache Hudi Issue Type: Bug Components: hive-sync, spark-sql Affects Versions: 0.10.0 Reporter: 董可伦 Assignee: 董可伦 Fix For: 0.10.1 h1. Fix delete exception for Spark SQL when sync Hive -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] nsivabalan edited a comment on issue #3879: [SUPPORT] Incomplete Table Migration
nsivabalan edited a comment on issue #3879: URL: https://github.com/apache/hudi/issues/3879#issuecomment-1017075893 May I know whats the partition path field I should be choosing while writing to hudi ? and I assume record key is UUID and preCombine field is SORT_KEY. ``` spark.sql("describe tbl1").show() ++-+---+ |col_name|data_type|comment| ++-+---+ |UUID| string| null| | A| string| null| | B|timestamp| null| | C| string| null| | D| int| null| | E|timestamp| null| | F| string| null| | G|timestamp| null| |SORT_KEY|timestamp| null| ++-+---+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4643: [HUDI-3281][Performance]Tuning performance of getAllPartitionPaths API in FileSystemBackedTableMetadata
hudi-bot removed a comment on pull request #4643: URL: https://github.com/apache/hudi/pull/4643#issuecomment-1017060755 ## CI report: * d63b3431b7e4331bf5bcc5e8789d008a296848f4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5354) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4643: [HUDI-3281][Performance]Tuning performance of getAllPartitionPaths API in FileSystemBackedTableMetadata
hudi-bot commented on pull request #4643: URL: https://github.com/apache/hudi/pull/4643#issuecomment-1017083372 ## CI report: * d63b3431b7e4331bf5bcc5e8789d008a296848f4 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5354) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on pull request #3745: [HUDI-2514] Fix delete exception for Spark SQL when sync Hive
YannByron commented on pull request #3745: URL: https://github.com/apache/hudi/pull/3745#issuecomment-1017082441 > @YannByron thanks for the review. can you take it from here please? from the reproducing steps, looks like a different bug where primary key and other properties were not respected by HMS? @xushiyan ok, i'll take this. And it'a real different case, better to create another pr and ticket. Otherwise, it can feel strange and confusing. @dongkelun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #3879: [SUPPORT] Incomplete Table Migration
nsivabalan commented on issue #3879: URL: https://github.com/apache/hudi/issues/3879#issuecomment-1017075893 May I know whats the partition path field I should be choosing while writing to hudi ? ``` spark.sql("describe tbl1").show() ++-+---+ |col_name|data_type|comment| ++-+---+ |UUID| string| null| | A| string| null| | B|timestamp| null| | C| string| null| | D| int| null| | E|timestamp| null| | F| string| null| | G|timestamp| null| |SORT_KEY|timestamp| null| ++-+---+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering
hudi-bot commented on pull request #4606: URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017071243 ## CI report: * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264) * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5356) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering
hudi-bot removed a comment on pull request #4606: URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017069976 ## CI report: * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264) * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering
hudi-bot removed a comment on pull request #4606: URL: https://github.com/apache/hudi/pull/4606#issuecomment-1013607661 ## CI report: * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering
hudi-bot commented on pull request #4606: URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017069976 ## CI report: * 80899c440c8c1b0d14b8d80a4f3de9ea87d0b8d4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5264) * a0b32bbf0d5d23b8facbe2581ad086433afc2de6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] watermelon12138 closed pull request #4637: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target table from multiple source tables.
watermelon12138 closed pull request #4637: URL: https://github.com/apache/hudi/pull/4637 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot commented on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1017061971 ## CI report: * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5044) * 6b59b2758827fad29ac77c2e7ac00ba4ee00cbbd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5355) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot removed a comment on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1017060566 ## CI report: * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5044) * 6b59b2758827fad29ac77c2e7ac00ba4ee00cbbd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4643: [HUDI-3281][Performance]Tuning performance of getAllPartitionPaths API in FileSystemBackedTableMetadata
hudi-bot commented on pull request #4643: URL: https://github.com/apache/hudi/pull/4643#issuecomment-1017060755 ## CI report: * d63b3431b7e4331bf5bcc5e8789d008a296848f4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5354) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4643: [HUDI-3281][Performance]Tuning performance of getAllPartitionPaths API in FileSystemBackedTableMetadata
hudi-bot removed a comment on pull request #4643: URL: https://github.com/apache/hudi/pull/4643#issuecomment-1017059408 ## CI report: * d63b3431b7e4331bf5bcc5e8789d008a296848f4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot commented on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1017060566 ## CI report: * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5044) * 6b59b2758827fad29ac77c2e7ac00ba4ee00cbbd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction
hudi-bot removed a comment on pull request #4441: URL: https://github.com/apache/hudi/pull/4441#issuecomment-1008557130 ## CI report: * cdb9542f861b32af8fdedb3f5107b3a6d60b3d2d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5040) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5044) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4643: [HUDI-3281][Performance]Tuning performance of getAllPartitionPaths API in FileSystemBackedTableMetadata
hudi-bot commented on pull request #4643: URL: https://github.com/apache/hudi/pull/4643#issuecomment-1017059408 ## CI report: * d63b3431b7e4331bf5bcc5e8789d008a296848f4 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] watermelon12138 commented on pull request #4637: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target table from multiple source tables.
watermelon12138 commented on pull request #4637: URL: https://github.com/apache/hudi/pull/4637#issuecomment-1017058557 @nsivabalan Thank you for the advice. However, the resumeCheckpointStr calculation method in DeltaSync applies only to updating a single target table by a single source. If multiple sources update a single target table, this calculation method does not work. If multiple sources update a single target, set an independent checkpoint for each source so that each source can recover from any checkpoint. I only changed the methods for calculating resumeCheckpointStr and saving checkpoints to checkpointCommitMetadata in DeltaSync. Specifically, I added these methods. This does not affect the calculation logic when a single source updates a single target. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3281) Tuning performance of getAllPartitionPaths in FileSystemBackedTableMetadata
[ https://issues.apache.org/jira/browse/HUDI-3281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-3281: - Labels: pull-request-available (was: ) > Tuning performance of getAllPartitionPaths in FileSystemBackedTableMetadata > --- > > Key: HUDI-3281 > URL: https://issues.apache.org/jira/browse/HUDI-3281 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Yue Zhang >Assignee: Yue Zhang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] zhangyue19921010 opened a new pull request #4643: [HUDI-3281][Performance]Tuning performance of getAllPartitionPaths API in FileSystemBackedTableMetadata
zhangyue19921010 opened a new pull request #4643: URL: https://github.com/apache/hudi/pull/4643 https://issues.apache.org/jira/projects/HUDI/issues/HUDI-3281 ## What is the purpose of the pull request Current implement of getAllPartitionPaths API list/collect all the data files and check `.hoodie_partition_metadata` in these files. As we know list is a pretty heavy action especially in scenarios where S3 is used as storage. We sometimes can see 20+ seconds for a single list action causing streaming job delay. ## Brief change log Just check if `partitions/.hoodie_partition_metadata` exists instead of getting all the data files under partitions. Here are the test result based on S3 ![image](https://user-images.githubusercontent.com/69956021/150256681-379d2138-2c4a-4f11-a703-1399b268290c.png) | | time1 |time2 | time3 |time4 |time5 |time6 |time7 |time8 |time9 |time10 | avg(ms) | | | | | | | | | | | | | | | Original | 6045 |5627 | 5736 | 5697 | 5733 | 5321 | 5462 | 5700 | 6072 | 5367 | 5676 | | Optimized | 2888 |2730 | 2717 | 2680 | 2684 | 2778 | 2650 | 2728 | 3107 | 2908 | 2787 | ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering
alexeykudinkin commented on pull request #4606: URL: https://github.com/apache/hudi/pull/4606#issuecomment-1017049008 @nsivabalan correct, all configs are kept and marked as deprecated. The only thing that changes is that some of them have actually no effect anymore. How should we handle this? For example `LAYOUT_OPTIMIZATION_ENABLE` is not used anymore, but that should not have an effect on users: 1. Those that didn't use Clustering based on Spatial Curves, they will stay the same way (there are other configs required for that) 2. Those that did use Clustering based on Spatial Curves, will also not be affected b/c it also required clustering to be enabled (which they should have to already had enabled) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] watermelon12138 removed a comment on pull request #4637: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target table from multiple source tables.
watermelon12138 removed a comment on pull request #4637: URL: https://github.com/apache/hudi/pull/4637#issuecomment-1017048479 @nsivabalan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] watermelon12138 commented on pull request #4637: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single target table from multiple source tables.
watermelon12138 commented on pull request #4637: URL: https://github.com/apache/hudi/pull/4637#issuecomment-1017048479 @nsivabalan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4606: [HUDI-2872][HUDI-2646] Refactoring layout optimization (clustering) flow to support linear ordering
alexeykudinkin commented on a change in pull request #4606: URL: https://github.com/apache/hudi/pull/4606#discussion_r788291798 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java ## @@ -207,41 +199,71 @@ .withDocumentation("Enable use z-ordering/space-filling curves to optimize the layout of table to boost query performance. " + "This parameter takes precedence over clustering strategy set using " + EXECUTION_STRATEGY_CLASS_NAME.key()); - public static final ConfigProperty LAYOUT_OPTIMIZE_STRATEGY = ConfigProperty + /** + * Determines ordering strategy in for records layout optimization. + * Currently, following strategies are supported + * + * Linear: simply orders records lexicographically + * Z-order: orders records along Z-order spatial-curve + * Hilbert: orders records along Hilbert's spatial-curve + * + * + * NOTE: "z-order", "hilbert" strategies may consume considerably more compute, than "linear". + * Make sure to perform small-scale local testing for your dataset before applying globally. + */ + public static final ConfigProperty LAYOUT_OPTIMIZE_STRATEGY = ConfigProperty .key(LAYOUT_OPTIMIZE_PARAM_PREFIX + "strategy") .defaultValue("z-order") .sinceVersion("0.10.0") - .withDocumentation("Type of layout optimization to be applied, current only supports `z-order` and `hilbert` curves."); + .withDocumentation("Determines ordering strategy used in records layout optimization. " + + "Currently supported strategies are \"linear\", \"z-order\" and \"hilbert\" values are supported."); /** - * There exists two method to build z-curve. - * one is directly mapping sort cols to z-value to build z-curve; - * we can find this method in Amazon DynamoDB https://aws.amazon.com/cn/blogs/database/tag/z-order/ - * the other one is Boundary-based Interleaved Index method which we proposed. simply call it sample method. - * Refer to rfc-28 for specific algorithm flow. - * Boundary-based Interleaved Index method has better generalization, but the build speed is slower than direct method. + * NOTE: This setting only has effect if {@link #LAYOUT_OPTIMIZE_STRATEGY} value is set to + * either "z-order" or "hilbert" (ie leveraging space-filling curves) + * + * Currently, two methods to order records along the curve are supported "build" and "sample": Review comment: Good catch! ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java ## @@ -134,16 +134,28 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, HoodieEngineContext * @return {@link RDDCustomColumnsSortPartitioner} if sort columns are provided, otherwise empty. */ protected Option> getPartitioner(Map strategyParams, Schema schema) { -if (getWriteConfig().isLayoutOptimizationEnabled()) { - // sort input records by z-order/hilbert - return Option.of(new RDDSpatialCurveOptimizationSortPartitioner((HoodieSparkEngineContext) getEngineContext(), - getWriteConfig(), HoodieAvroUtils.addMetadataFields(schema))); -} else if (strategyParams.containsKey(PLAN_STRATEGY_SORT_COLUMNS.key())) { - return Option.of(new RDDCustomColumnsSortPartitioner(strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key()).split(","), - HoodieAvroUtils.addMetadataFields(schema), getWriteConfig().isConsistentLogicalTimestampEnabled())); -} else { - return Option.empty(); -} +Option orderByColumnsOpt = +Option.ofNullable(strategyParams.get(PLAN_STRATEGY_SORT_COLUMNS.key())) +.map(listStr -> listStr.split(",")); + +return orderByColumnsOpt.map(orderByColumns -> { Review comment: It will fallback to no-op in that case -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun edited a comment on issue #4642: [SUPPORT] Hudi Merge Into
dongkelun edited a comment on issue #4642: URL: https://github.com/apache/hudi/issues/4642#issuecomment-1017039375 Instead of using a temporary table, try changing the target table to an entity table. The current version should not support the situation that the target table is a temporary table.And the target table must be a Hudi table -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org