[GitHub] [hudi] felixYyu commented on a diff in pull request #5064: [HUDI-3654] Add new module `hudi-metaserver`
felixYyu commented on code in PR #5064: URL: https://github.com/apache/hudi/pull/5064#discussion_r984168332 ## hudi-metaserver/src/main/resources/mybatis/DDLMapper.xml: ## @@ -0,0 +1,127 @@ + + +http://mybatis.org/dtd/mybatis-3-mapper.dtd;> + + + +CREATE TABLE dbs +( +db_id BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT COMMENT 'uuid', +desc VARCHAR(512) COMMENT 'database description', +location_uri VARCHAR(512) COMMENT 'database storage path', +name VARCHAR(512) UNIQUE COMMENT 'database name', +owner_name VARCHAR(512) COMMENT 'database owner', +owner_type VARCHAR(512) COMMENT 'database type', +create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT 'db created time', +update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'update time' +) COMMENT 'databases'; + + + + +CREATE TABLE tbls +( +tbl_id BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT COMMENT 'uuid', +db_id BIGINT COMMENT 'database id', +name VARCHAR(512) COMMENT 'table name', +create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT 'table created time', +update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'update time', +owner_name VARCHAR(512) COMMENT 'table owner', +location VARCHAR(512) COMMENT 'table location', +UNIQUE KEY uniq_tb (db_id, name) +) COMMENT 'tables'; + + + +CREATE TABLE tbl_params +( +tbl_id BIGINT UNSIGNED COMMENT 'tbl id', +param_key VARCHAR(256) COMMENT 'param_key', +param_value VARCHAR(2048) COMMENT 'param_value', +create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT 'parameter created time', +update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'update time', +PRIMARY KEY (tbl_id, param_key) +) COMMENT 'tbl params'; + + + +CREATE TABLE partitions +( +part_id BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT COMMENT 'uuid', +tbl_id BIGINT COMMENT 'table id', +part_name VARCHAR(256) COMMENT 'partition path', +create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT 'create time', +update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'update time', +is_deleted BOOL DEFAULT FALSE COMMENT 'whether the partition is deleted', +UNIQUE uniq_partition_version (tbl_id, part_name) +) COMMENT 'partitions'; + + + +CREATE TABLE tbl_timestamp +( +tbl_id BIGINT UNSIGNED PRIMARY KEY COMMENT 'uuid', +ts VARCHAR(17) COMMENT 'instant timestamp' +) COMMENT 'generate the unique timestamp for a table'; + + + +CREATE TABLE instant +( +instant_id BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT COMMENT 'uuid', +tbl_id BIGINT COMMENT 'table id', +ts VARCHAR(17) COMMENT 'instant timestamp', +action TINYINT COMMENT 'commit, deltacommit, compaction, replace etc', +stateTINYINT COMMENT 'completed, requested, inflight, invalid etc', +duration INT DEFAULT 0 COMMENT 'for heartbeat (s)', +start_ts INT DEFAULT 0 COMMENT 'for heartbeat (s)', +UNIQUE KEY uniq_inst1 (tbl_id, state, ts, action), +UNIQUE KEY uniq_inst2 (tbl_id, ts) +) COMMENT 'timeline'; + + + +CREATE TABLE instant_meta +( +commit_id BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT COMMENT 'uuid', +tbl_id BIGINT COMMENT 'table id', +ts VARCHAR(17) COMMENT 'instant timestamp', +action TINYINT COMMENT 'commit, deltacommit, compaction, replace etc', +state TINYINT COMMENT 'completed, requested, inflight, invalid etc', +data LONGBLOB COMMENT 'instant metadate', Review Comment: typo 'metadate'->'metadata' ## hudi-metaserver/src/main/java/org/apache/hudi/common/table/HoodieTableMetaServerClient.java: ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY
[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jayasheel Kalgal updated HUDI-4953: --- Description: Typo in Hudi documentation for - *NonPartitionedKeyGenerator* URL - [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] Issue : Classname to use for non partitioned tables should be {color:#0747a6}*NonpartitionedKeyGenerator* ( currently *NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in lowercase. [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] was: Typo in Hudi documentation for - *NonPartitionedKeyGenerator* URL - [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] Issue : Classname to use for non partitioned tables should be {color:#0747a6}*NonpartitionedKeyGenerator* ( currently *NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in lowercase. [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] > Typo in Hudi documentation about NonPartitionedKeyGenerator > --- > > Key: HUDI-4953 > URL: https://issues.apache.org/jira/browse/HUDI-4953 > Project: Apache Hudi > Issue Type: Bug > Components: docs >Reporter: Jayasheel Kalgal >Priority: Major > > Typo in Hudi documentation for - *NonPartitionedKeyGenerator* > > URL - > [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] > [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] > > Issue : > Classname to use for non partitioned tables should be > {color:#0747a6}*NonpartitionedKeyGenerator* ( currently > *NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in > lowercase. > > [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jayasheel Kalgal updated HUDI-4953: --- Description: Typo in Hudi documentation for - *NonPartitionedKeyGenerator* URL - [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] Issue : Classname to use for non partitioned tables should be {color:#0747a6}*NonpartitionedKeyGenerator* ( currently *NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in lowercase. [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] was: Typo in Hudi documentation for - *NonPartitionedKeyGenerator* URL - [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] Issue : Classname to use for non partitioned tables should be {color:#0747a6}*NonpartitionedKeyGenerator* ( currently *NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in lowercase (Non{*}p{*}artitionedKeyGenerator) [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] > Typo in Hudi documentation about NonPartitionedKeyGenerator > --- > > Key: HUDI-4953 > URL: https://issues.apache.org/jira/browse/HUDI-4953 > Project: Apache Hudi > Issue Type: Bug > Components: docs >Reporter: Jayasheel Kalgal >Priority: Major > > Typo in Hudi documentation for - *NonPartitionedKeyGenerator* > > URL - > [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] > [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] > > > > Issue : > > Classname to use for non partitioned tables should be > {color:#0747a6}*NonpartitionedKeyGenerator* ( currently > *NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in > lowercase. > > [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jayasheel Kalgal updated HUDI-4953: --- Description: Typo in Hudi documentation for - *NonPartitionedKeyGenerator* URL - [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] Issue : Classname to use for non partitioned tables should be {color:#0747a6}*NonpartitionedKeyGenerator* ( currently *NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in lowercase (Non{*}p{*}artitionedKeyGenerator) [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] was: Typo in Hudi documentation for - *NonPartitionedKeyGenerator* URL - [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] Issue : Classname to use for non partitioned tables should be {color:#0747a6}*NonpartitionedKeyGenerator* ( currently *{color:#de350b}{color:#0747a6}NonPartitionedKeyGenerator{color}){color}* as per this repo. *P* should be in lowercase (Non{*}p{*}artitionedKeyGenerator){color} [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] > Typo in Hudi documentation about NonPartitionedKeyGenerator > --- > > Key: HUDI-4953 > URL: https://issues.apache.org/jira/browse/HUDI-4953 > Project: Apache Hudi > Issue Type: Bug > Components: docs >Reporter: Jayasheel Kalgal >Priority: Major > > Typo in Hudi documentation for - *NonPartitionedKeyGenerator* > > URL - > [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] > [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] > > > > Issue : > > Classname to use for non partitioned tables should be > {color:#0747a6}*NonpartitionedKeyGenerator* ( currently > *NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in > lowercase (Non{*}p{*}artitionedKeyGenerator) > > [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jayasheel Kalgal updated HUDI-4953: --- Priority: Major (was: Minor) > Typo in Hudi documentation about NonPartitionedKeyGenerator > --- > > Key: HUDI-4953 > URL: https://issues.apache.org/jira/browse/HUDI-4953 > Project: Apache Hudi > Issue Type: Bug > Components: docs >Reporter: Jayasheel Kalgal >Priority: Major > > Typo in Hudi documentation for - *NonPartitionedKeyGenerator* > > URL - > [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] > [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] > > > > Issue : > > Classname to use for non partitioned tables should be > {color:#0747a6}NonpartitionedKeyGenerator {color:#172b4d}as per this repo. > *P* should be in lowercase (Non{*}p{*}artitionedKeyGenerator){color}{color} > > [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jayasheel Kalgal updated HUDI-4953: --- Description: Typo in Hudi documentation for - *NonPartitionedKeyGenerator* URL - [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] Issue : Classname to use for non partitioned tables should be {color:#0747a6}NonpartitionedKeyGenerator {color:#172b4d}as per this repo. *P* should be in lowercase (Non{*}p{*}artitionedKeyGenerator){color}{color} [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] was: Typo in Hudi documentation for *[{color:#172b4d}Nonpartitionedkeygenerator{color}|https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]* URL - [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] Issue : Classname to use for non partitioned tables should be {color:#0747a6}NonpartitionedKeyGenerator as per this repo.{color} [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] P should be in lowercase (Non{*}p{*}artitionedKeyGenerator) > Typo in Hudi documentation about NonPartitionedKeyGenerator > --- > > Key: HUDI-4953 > URL: https://issues.apache.org/jira/browse/HUDI-4953 > Project: Apache Hudi > Issue Type: Bug > Components: docs >Reporter: Jayasheel Kalgal >Priority: Minor > > Typo in Hudi documentation for - *NonPartitionedKeyGenerator* > > URL - > [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] > [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] > > > > Issue : > > Classname to use for non partitioned tables should be > {color:#0747a6}NonpartitionedKeyGenerator {color:#172b4d}as per this repo. > *P* should be in lowercase (Non{*}p{*}artitionedKeyGenerator){color}{color} > > [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jayasheel Kalgal updated HUDI-4953: --- Description: Typo in Hudi documentation for - *NonPartitionedKeyGenerator* URL - [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] Issue : Classname to use for non partitioned tables should be {color:#0747a6}*NonpartitionedKeyGenerator* ( currently *{color:#de350b}{color:#0747a6}NonPartitionedKeyGenerator{color}){color}* as per this repo. *P* should be in lowercase (Non{*}p{*}artitionedKeyGenerator){color} [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] was: Typo in Hudi documentation for - *NonPartitionedKeyGenerator* URL - [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] Issue : Classname to use for non partitioned tables should be {color:#0747a6}NonpartitionedKeyGenerator {color:#172b4d}as per this repo. *P* should be in lowercase (Non{*}p{*}artitionedKeyGenerator){color}{color} [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] > Typo in Hudi documentation about NonPartitionedKeyGenerator > --- > > Key: HUDI-4953 > URL: https://issues.apache.org/jira/browse/HUDI-4953 > Project: Apache Hudi > Issue Type: Bug > Components: docs >Reporter: Jayasheel Kalgal >Priority: Major > > Typo in Hudi documentation for - *NonPartitionedKeyGenerator* > > URL - > [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] > [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] > > > > Issue : > > Classname to use for non partitioned tables should be > {color:#0747a6}*NonpartitionedKeyGenerator* ( currently > *{color:#de350b}{color:#0747a6}NonPartitionedKeyGenerator{color}){color}* as > per this repo. *P* should be in lowercase > (Non{*}p{*}artitionedKeyGenerator){color} > > [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on issue #6800: [SUPPORT]org.apache.avro.SchemaParseException: Illegal initial character: 1Min
nsivabalan commented on issue #6800: URL: https://github.com/apache/hudi/issues/6800#issuecomment-1263123231 if you are using deltastreamer, you can add a schema post processor and rename columns. if not,can't think of any easy solution apart from manually fixing it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator
[ https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jayasheel Kalgal updated HUDI-4953: --- Description: Typo in Hudi documentation for *[{color:#172b4d}Nonpartitionedkeygenerator{color}|https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]* URL - [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] Issue : Classname to use for non partitioned tables should be {color:#0747a6}NonpartitionedKeyGenerator as per this repo.{color} [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] P should be in lowercase (Non{*}p{*}artitionedKeyGenerator) was: Typo in Hudi documentation for [Nonpartitionedkeygenerator|https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] URL - [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] Issue : Classname to use for non partitioned tables should be {color:#0747a6}NonpartitionedKeyGenerator {color:#172b4d}as per this repo.{color}{color} [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] P should be in lowercase ({color:#0747a6}Non{*}p{*}artitionedKeyGenerator){color} > Typo in Hudi documentation about NonPartitionedKeyGenerator > --- > > Key: HUDI-4953 > URL: https://issues.apache.org/jira/browse/HUDI-4953 > Project: Apache Hudi > Issue Type: Bug > Components: docs >Reporter: Jayasheel Kalgal >Priority: Minor > > Typo in Hudi documentation for > *[{color:#172b4d}Nonpartitionedkeygenerator{color}|https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]* > > URL - > [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator] > > [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] > > > > Issue : > > Classname to use for non partitioned tables should be > {color:#0747a6}NonpartitionedKeyGenerator as per this repo.{color} > > [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] > > P should be in lowercase (Non{*}p{*}artitionedKeyGenerator) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on issue #6800: [SUPPORT]org.apache.avro.SchemaParseException: Illegal initial character: 1Min
nsivabalan commented on issue #6800: URL: https://github.com/apache/hudi/issues/6800#issuecomment-1263122615 we rely on avro's field naming conventions. looks like starting char cannot be numbers. https://issues.apache.org/jira/browse/AVRO-153 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6804: [SUPPORT] Repairing the hudi table from No such file or directory of parquet file.
nsivabalan commented on issue #6804: URL: https://github.com/apache/hudi/issues/6804#issuecomment-1263121682 if not for metadata table, can't think of easier way to go about this. essentially cleaner has cleaned up some data file which is being required by the query. if you have very aggressive cleaner configs, you may try to relax them based on the max time any query can take for the table of interest. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6825: [SUPPORT]org.apache.hudi.exception.HoodieRemoteException: *****:37568 failed to respond
nsivabalan commented on issue #6825: URL: https://github.com/apache/hudi/issues/6825#issuecomment-1263120280 guess timeline server crashed for some reason. CC @yihua any thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator
Jayasheel Kalgal created HUDI-4953: -- Summary: Typo in Hudi documentation about NonPartitionedKeyGenerator Key: HUDI-4953 URL: https://issues.apache.org/jira/browse/HUDI-4953 Project: Apache Hudi Issue Type: Bug Components: docs Reporter: Jayasheel Kalgal Typo in Hudi documentation for [Nonpartitionedkeygenerator|https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] URL - [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator] Issue : Classname to use for non partitioned tables should be {color:#0747a6}NonpartitionedKeyGenerator {color:#172b4d}as per this repo.{color}{color} [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37] P should be in lowercase ({color:#0747a6}Non{*}p{*}artitionedKeyGenerator){color} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan commented on issue #6835: [SUPPORT] hive doesnt support mor read now, pls confirm
nsivabalan commented on issue #6835: URL: https://github.com/apache/hudi/issues/6835#issuecomment-1263117459 since we have a patch being actively worked on, closing the issue. thanks for reporting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #6835: [SUPPORT] hive doesnt support mor read now, pls confirm
nsivabalan closed issue #6835: [SUPPORT] hive doesnt support mor read now, pls confirm URL: https://github.com/apache/hudi/issues/6835 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5582: [SUPPORT] NullPointerException in merge into Spark Sql HoodieSparkSqlWriter$.mergeParamsAndGetHoodieConfig
nsivabalan commented on issue #5582: URL: https://github.com/apache/hudi/issues/5582#issuecomment-1263116299 @nitinkul @vicuna96 : gentle ping. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6503: [SUPPORT] Hudi Merge Into with larger volume
nsivabalan commented on issue #6503: URL: https://github.com/apache/hudi/issues/6503#issuecomment-1263115964 my understanding is that, preCombine is a mandatory field for merge into statement. But I will let @alexeykudinkin investigate further though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5777: [SUPPORT] Hudi table has duplicate data.
nsivabalan commented on issue #5777: URL: https://github.com/apache/hudi/issues/5777#issuecomment-1263114829 I see you have given test data. is everything to be ingested in 1 single commit. or using diff commits. your reproducible script is not very clear on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5777: [SUPPORT] Hudi table has duplicate data.
nsivabalan commented on issue #5777: URL: https://github.com/apache/hudi/issues/5777#issuecomment-1263114477 @jiangjiguang : did not realize you had give us a reproducible code snippet. so from what you have given above, you could see duplicate data w/ MOR RT query? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5777: [SUPPORT] Hudi table has duplicate data.
nsivabalan commented on issue #5777: URL: https://github.com/apache/hudi/issues/5777#issuecomment-1263111919 sorry to have dropped the ball on this. again picking it up. btw, I see this config `hoodie.datasource.write.insert.drop.duplicates` was proposed earlier. do not set this to true. if yes, records from incoming batch if they are already in storage, it will be dropped. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jiangbiao910 commented on issue #6462: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of partition from metadata
jiangbiao910 commented on issue #6462: URL: https://github.com/apache/hudi/issues/6462#issuecomment-1263109912 @nsivabalan Thank you for your reply, if I don't set hoodie.metadata.enable"="false",throw "java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.client.HdfsDataInputStream.getReadStatistics()" if I set hoodie.metadata.enable"="false", often not every time throw “ Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of partition from metadata” But when I run the sql again, it work well. I think Hbase relies on Hadoop 2.10.0,But Our environment is CDH-6.3.2 and hadoop version is 3.0。 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-4934) Cleaner cleans up files touched by clustering
[ https://issues.apache.org/jira/browse/HUDI-4934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-4934. - Resolution: Fixed > Cleaner cleans up files touched by clustering > - > > Key: HUDI-4934 > URL: https://issues.apache.org/jira/browse/HUDI-4934 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > > I have some integration long running tests w/ cleaner and clustering. from > 21st or 22nd of sep, my tests have started to fail. > > Reason is, when clustering kicks in, it could not find the data files to be > clustered. Looks like cleaner has cleaned it up. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits
hudi-bot commented on PR #6836: URL: https://github.com/apache/hudi/pull/6836#issuecomment-1263105795 ## CI report: * 77223f8b87bdfcfa75045fb622b127cc4f9e47ab Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11919) * 34427d0e522bec7eee731644080bd0b5d20570dc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11921) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6818: [HUDI-4948] Improve CDC Write
hudi-bot commented on PR #6818: URL: https://github.com/apache/hudi/pull/6818#issuecomment-1263105772 ## CI report: * f14363a4be66f8a05ddbbe14600176da151d04ff Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11843) * e0ccacd8d030984ed30f19b17b0dafb02d8685ee Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11920) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6793: 【HUDI-4917】Optimized the way to get HoodieBaseFile of loadColumnRange…
hudi-bot commented on PR #6793: URL: https://github.com/apache/hudi/pull/6793#issuecomment-1263105718 ## CI report: * 32cc352122d276f5bb5943a0dd420920854fdb8e Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11837) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11916) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4948) Support flush and rollover for CDC Write
[ https://issues.apache.org/jira/browse/HUDI-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4948: - Labels: pull-request-available (was: ) > Support flush and rollover for CDC Write > > > Key: HUDI-4948 > URL: https://issues.apache.org/jira/browse/HUDI-4948 > Project: Apache Hudi > Issue Type: Improvement > Components: core, spark, writer-core >Reporter: Yann Byron >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits
hudi-bot commented on PR #6836: URL: https://github.com/apache/hudi/pull/6836#issuecomment-1263103453 ## CI report: * 77223f8b87bdfcfa75045fb622b127cc4f9e47ab Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11919) * 34427d0e522bec7eee731644080bd0b5d20570dc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6818: [HUDI-4948] Improve CDC Write
hudi-bot commented on PR #6818: URL: https://github.com/apache/hudi/pull/6818#issuecomment-1263103404 ## CI report: * f14363a4be66f8a05ddbbe14600176da151d04ff Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11843) * e0ccacd8d030984ed30f19b17b0dafb02d8685ee UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6741: [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table
hudi-bot commented on PR #6741: URL: https://github.com/apache/hudi/pull/6741#issuecomment-1263100954 ## CI report: * bff3acafde6d8a1bd5574b90ce644ef30acbf0a2 UNKNOWN * e39d50d6242e272f867c9987a8a2e97ca323568f Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11886) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11915) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6101: [SUPPORT] Hudi Delete Not working with EMR, AWS Glue & S3
nsivabalan commented on issue #6101: URL: https://github.com/apache/hudi/issues/6101#issuecomment-1263071072 @navbalaraman : hey any updates for us. if you could not reproduce, feel free to close it out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6504: [SUPPORT] Hudi deletes fail in HoodieDeltaStreamer
nsivabalan commented on issue #6504: URL: https://github.com/apache/hudi/issues/6504#issuecomment-1263070852 @santoshraj123 : gentle ping. if you got the issue resolved, feel free to close it out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6428: [SUPPORT] S3 Deltastreamer: Block has already been inflated
nsivabalan commented on issue #6428: URL: https://github.com/apache/hudi/issues/6428#issuecomment-1263070600 Since we could not reproduce w/ OSS spark, can you reach out to aws support. CC @umehrot2 @rahil-c : Have you folks seen this issue before. seems like simple read from metadata table is failing w/ EMR spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6428: [SUPPORT] S3 Deltastreamer: Block has already been inflated
nsivabalan commented on issue #6428: URL: https://github.com/apache/hudi/issues/6428#issuecomment-1263069837 yes, you are right. you can disable via hudi-cli as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6421: [SUPPORT]Table property not working while creating table - hoodie.datasource.write.drop.partition.columns
nsivabalan commented on issue #6421: URL: https://github.com/apache/hudi/issues/6421#issuecomment-1263069591 @sandip-yadav : gentle ping. did you get a chance to try 0.12. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wwli05 commented on issue #6835: [SUPPORT] hive doesnt support mor read now, pls confirm
wwli05 commented on issue #6835: URL: https://github.com/apache/hudi/issues/6835#issuecomment-1263069076 thank you ,friends, really JI_SHI_YU -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits
hudi-bot commented on PR #6836: URL: https://github.com/apache/hudi/pull/6836#issuecomment-1263068964 ## CI report: * 77223f8b87bdfcfa75045fb622b127cc4f9e47ab Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11919) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #6462: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of partition from metadata
nsivabalan closed issue #6462: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of partition from metadata URL: https://github.com/apache/hudi/issues/6462 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6462: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of partition from metadata
nsivabalan commented on issue #6462: URL: https://github.com/apache/hudi/issues/6462#issuecomment-1263068927 closing github issue as we have a fix. thanks for reporting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits
hudi-bot commented on PR #6836: URL: https://github.com/apache/hudi/pull/6836#issuecomment-1263066749 ## CI report: * 77223f8b87bdfcfa75045fb622b127cc4f9e47ab UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6462: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of partition from metadata
nsivabalan commented on issue #6462: URL: https://github.com/apache/hudi/issues/6462#issuecomment-1263064400 https://github.com/apache/hudi/pull/6836 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
hudi-bot commented on PR #6358: URL: https://github.com/apache/hudi/pull/6358#issuecomment-1263063979 ## CI report: * 288d166c49602a4593b1e97763a467811903737d UNKNOWN * ae59f6f918a5a08535b73be5c3fc2f29f5e84fb9 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11879) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11913) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4952) Reading from metadata table could fail when there are no completed commits
[ https://issues.apache.org/jira/browse/HUDI-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4952: - Labels: pull-request-available (was: ) > Reading from metadata table could fail when there are no completed commits > -- > > Key: HUDI-4952 > URL: https://issues.apache.org/jira/browse/HUDI-4952 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > > When metadata table is just getting initialized, but first commit is not yet > fully complete, reading from metadata table could fail w/ below stacktrace. > > {code:java} > 22/08/20 02:56:58 ERROR client.RemoteDriver: Failed to run client job > 39d720db-b15d-4823-b8b1-54398b143d6e > org.apache.hudi.exception.HoodieException: Error fetching partition paths > from metadata table > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315) > at > org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:176) > at > org.apache.hudi.BaseHoodieTableFileIndex.loadPartitionPathFiles(BaseHoodieTableFileIndex.java:219) > at > org.apache.hudi.BaseHoodieTableFileIndex.doRefresh(BaseHoodieTableFileIndex.java:264) > at > org.apache.hudi.BaseHoodieTableFileIndex.(BaseHoodieTableFileIndex.java:139) > at > org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:49) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:234) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:141) > at > org.apache.hudi.hadoop.HoodieParquetInputFormatBase.listStatus(HoodieParquetInputFormatBase.java:90) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.listStatus(HoodieCombineHiveInputFormat.java:889) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217) > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getSplits(HoodieCombineHiveInputFormat.java:942) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getCombineSplits(HoodieCombineHiveInputFormat.java:241) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getSplits(HoodieCombineHiveInputFormat.java:363) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) > at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267) > at > org.apache.spark.api.java.JavaRDDLike$class.getNumPartitions(JavaRDDLike.scala:65) > at > org.apache.spark.api.java.AbstractJavaRDDLike.getNumPartitions(JavaRDDLike.scala:45) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:252) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:179) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:130) > at > org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:355) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:400) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to > retrieve list of partition from metadata > at > org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:113) > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313) > ... 32 more > Caused by: java.util.NoSuchElementException: No value present in Option > at org.apache.hudi.common.util.Option.get(Option.java:89) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionFileSlices(HoodieTableMetadataUtil.java:1057) > at >
[GitHub] [hudi] nsivabalan opened a new pull request, #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits
nsivabalan opened a new pull request, #6836: URL: https://github.com/apache/hudi/pull/6836 ### Change Logs When metadata table is just getting initialized, but first commit is not yet fully complete, reading from metadata table could fail w/ below stacktrace. Call trace that could result in this. ``` BaseHoodieTableFileIndex.doRefresh() // metadata Config will have metadata enabled if user enables for the query session. lets assume user enabled while the metadata table is being built out. { HoodieTableMetadata newTableMetadata = HoodieTableMetadata.create(engineContext, metadataConfig, ); HoodieTableMetadata.create eventually will call constructor of HoodieBackedTableMetadata() within which we call initIfNeeded() within initIfNeeded { we disable metadata only if table itself is not found. if not, metadata is still enabled. } . -> loadPartitionPathFiles } loadPartitionPathFiles { ... getAllFilesInPartitionsUnchecked() } getAllFilesInPartitionsUnchecked { tableMetadata.getAllFilesInPartitions(list of interested partitions) } getAllFilesInPartitions{ BaseTableMetadata.fetchAllFilesInPartitionPaths... } BaseTableMetadata.fetchAllFilesInPartitionPaths{ .. getRecordsByKeys(...) } HoodieBackedTableMetadata.getRecordsByKeys{ getPartitionFileSliceToKeysMapping() } getPartitionFileSliceToKeysMapping{ List latestFileSlices = HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(metadataMetaClient, partitionName); } HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices { HoodieTableFileSystemView fsView = fileSystemView.orElse(getFileSystemView(metaClient)); Stream fileSliceStream; if (mergeFileSlices) { // this is true for this call graph. if (metaClient.getActiveTimeline().filterCompletedInstants().lastInstant().isPresent()) { fileSliceStream = fsView.getLatestMergedFileSlicesBeforeOrOn( partition, metaClient.getActiveTimeline().filterCompletedInstants().**lastInstant().get()**.getTimestamp()); } } ``` There is no lastInstant as the Metadata table is still being initialized. ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on issue #6835: [SUPPORT] hive doesnt support mor read now, pls confirm
xiarixiaoyao commented on issue #6835: URL: https://github.com/apache/hudi/issues/6835#issuecomment-1263059421 https://github.com/apache/hudi/pull/6741 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4952) Reading from metadata table could fail when there are no completed commits
[ https://issues.apache.org/jira/browse/HUDI-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4952: -- Sprint: 2022/09/19 > Reading from metadata table could fail when there are no completed commits > -- > > Key: HUDI-4952 > URL: https://issues.apache.org/jira/browse/HUDI-4952 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Priority: Major > > When metadata table is just getting initialized, but first commit is not yet > fully complete, reading from metadata table could fail w/ below stacktrace. > > {code:java} > 22/08/20 02:56:58 ERROR client.RemoteDriver: Failed to run client job > 39d720db-b15d-4823-b8b1-54398b143d6e > org.apache.hudi.exception.HoodieException: Error fetching partition paths > from metadata table > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315) > at > org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:176) > at > org.apache.hudi.BaseHoodieTableFileIndex.loadPartitionPathFiles(BaseHoodieTableFileIndex.java:219) > at > org.apache.hudi.BaseHoodieTableFileIndex.doRefresh(BaseHoodieTableFileIndex.java:264) > at > org.apache.hudi.BaseHoodieTableFileIndex.(BaseHoodieTableFileIndex.java:139) > at > org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:49) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:234) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:141) > at > org.apache.hudi.hadoop.HoodieParquetInputFormatBase.listStatus(HoodieParquetInputFormatBase.java:90) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.listStatus(HoodieCombineHiveInputFormat.java:889) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217) > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getSplits(HoodieCombineHiveInputFormat.java:942) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getCombineSplits(HoodieCombineHiveInputFormat.java:241) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getSplits(HoodieCombineHiveInputFormat.java:363) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) > at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267) > at > org.apache.spark.api.java.JavaRDDLike$class.getNumPartitions(JavaRDDLike.scala:65) > at > org.apache.spark.api.java.AbstractJavaRDDLike.getNumPartitions(JavaRDDLike.scala:45) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:252) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:179) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:130) > at > org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:355) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:400) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to > retrieve list of partition from metadata > at > org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:113) > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313) > ... 32 more > Caused by: java.util.NoSuchElementException: No value present in Option > at org.apache.hudi.common.util.Option.get(Option.java:89) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionFileSlices(HoodieTableMetadataUtil.java:1057) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(HoodieTableMetadataUtil.java:1001) > at >
[jira] [Updated] (HUDI-4952) Reading from metadata table could fail when there are no completed commits
[ https://issues.apache.org/jira/browse/HUDI-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4952: -- Priority: Blocker (was: Major) > Reading from metadata table could fail when there are no completed commits > -- > > Key: HUDI-4952 > URL: https://issues.apache.org/jira/browse/HUDI-4952 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Fix For: 0.12.1 > > > When metadata table is just getting initialized, but first commit is not yet > fully complete, reading from metadata table could fail w/ below stacktrace. > > {code:java} > 22/08/20 02:56:58 ERROR client.RemoteDriver: Failed to run client job > 39d720db-b15d-4823-b8b1-54398b143d6e > org.apache.hudi.exception.HoodieException: Error fetching partition paths > from metadata table > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315) > at > org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:176) > at > org.apache.hudi.BaseHoodieTableFileIndex.loadPartitionPathFiles(BaseHoodieTableFileIndex.java:219) > at > org.apache.hudi.BaseHoodieTableFileIndex.doRefresh(BaseHoodieTableFileIndex.java:264) > at > org.apache.hudi.BaseHoodieTableFileIndex.(BaseHoodieTableFileIndex.java:139) > at > org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:49) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:234) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:141) > at > org.apache.hudi.hadoop.HoodieParquetInputFormatBase.listStatus(HoodieParquetInputFormatBase.java:90) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.listStatus(HoodieCombineHiveInputFormat.java:889) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217) > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getSplits(HoodieCombineHiveInputFormat.java:942) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getCombineSplits(HoodieCombineHiveInputFormat.java:241) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getSplits(HoodieCombineHiveInputFormat.java:363) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) > at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267) > at > org.apache.spark.api.java.JavaRDDLike$class.getNumPartitions(JavaRDDLike.scala:65) > at > org.apache.spark.api.java.AbstractJavaRDDLike.getNumPartitions(JavaRDDLike.scala:45) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:252) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:179) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:130) > at > org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:355) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:400) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to > retrieve list of partition from metadata > at > org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:113) > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313) > ... 32 more > Caused by: java.util.NoSuchElementException: No value present in Option > at org.apache.hudi.common.util.Option.get(Option.java:89) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionFileSlices(HoodieTableMetadataUtil.java:1057) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(HoodieTableMetadataUtil.java:1001) > at >
[jira] [Assigned] (HUDI-4952) Reading from metadata table could fail when there are no completed commits
[ https://issues.apache.org/jira/browse/HUDI-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-4952: - Assignee: sivabalan narayanan > Reading from metadata table could fail when there are no completed commits > -- > > Key: HUDI-4952 > URL: https://issues.apache.org/jira/browse/HUDI-4952 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > > When metadata table is just getting initialized, but first commit is not yet > fully complete, reading from metadata table could fail w/ below stacktrace. > > {code:java} > 22/08/20 02:56:58 ERROR client.RemoteDriver: Failed to run client job > 39d720db-b15d-4823-b8b1-54398b143d6e > org.apache.hudi.exception.HoodieException: Error fetching partition paths > from metadata table > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315) > at > org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:176) > at > org.apache.hudi.BaseHoodieTableFileIndex.loadPartitionPathFiles(BaseHoodieTableFileIndex.java:219) > at > org.apache.hudi.BaseHoodieTableFileIndex.doRefresh(BaseHoodieTableFileIndex.java:264) > at > org.apache.hudi.BaseHoodieTableFileIndex.(BaseHoodieTableFileIndex.java:139) > at > org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:49) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:234) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:141) > at > org.apache.hudi.hadoop.HoodieParquetInputFormatBase.listStatus(HoodieParquetInputFormatBase.java:90) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.listStatus(HoodieCombineHiveInputFormat.java:889) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217) > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getSplits(HoodieCombineHiveInputFormat.java:942) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getCombineSplits(HoodieCombineHiveInputFormat.java:241) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getSplits(HoodieCombineHiveInputFormat.java:363) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) > at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267) > at > org.apache.spark.api.java.JavaRDDLike$class.getNumPartitions(JavaRDDLike.scala:65) > at > org.apache.spark.api.java.AbstractJavaRDDLike.getNumPartitions(JavaRDDLike.scala:45) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:252) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:179) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:130) > at > org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:355) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:400) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to > retrieve list of partition from metadata > at > org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:113) > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313) > ... 32 more > Caused by: java.util.NoSuchElementException: No value present in Option > at org.apache.hudi.common.util.Option.get(Option.java:89) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionFileSlices(HoodieTableMetadataUtil.java:1057) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(HoodieTableMetadataUtil.java:1001) > at >
[jira] [Updated] (HUDI-4952) Reading from metadata table could fail when there are no completed commits
[ https://issues.apache.org/jira/browse/HUDI-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4952: -- Fix Version/s: 0.12.1 > Reading from metadata table could fail when there are no completed commits > -- > > Key: HUDI-4952 > URL: https://issues.apache.org/jira/browse/HUDI-4952 > Project: Apache Hudi > Issue Type: Bug > Components: metadata >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.12.1 > > > When metadata table is just getting initialized, but first commit is not yet > fully complete, reading from metadata table could fail w/ below stacktrace. > > {code:java} > 22/08/20 02:56:58 ERROR client.RemoteDriver: Failed to run client job > 39d720db-b15d-4823-b8b1-54398b143d6e > org.apache.hudi.exception.HoodieException: Error fetching partition paths > from metadata table > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315) > at > org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:176) > at > org.apache.hudi.BaseHoodieTableFileIndex.loadPartitionPathFiles(BaseHoodieTableFileIndex.java:219) > at > org.apache.hudi.BaseHoodieTableFileIndex.doRefresh(BaseHoodieTableFileIndex.java:264) > at > org.apache.hudi.BaseHoodieTableFileIndex.(BaseHoodieTableFileIndex.java:139) > at > org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:49) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:234) > at > org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:141) > at > org.apache.hudi.hadoop.HoodieParquetInputFormatBase.listStatus(HoodieParquetInputFormatBase.java:90) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.listStatus(HoodieCombineHiveInputFormat.java:889) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217) > at > org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getSplits(HoodieCombineHiveInputFormat.java:942) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getCombineSplits(HoodieCombineHiveInputFormat.java:241) > at > org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getSplits(HoodieCombineHiveInputFormat.java:363) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) > at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267) > at > org.apache.spark.api.java.JavaRDDLike$class.getNumPartitions(JavaRDDLike.scala:65) > at > org.apache.spark.api.java.AbstractJavaRDDLike.getNumPartitions(JavaRDDLike.scala:45) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:252) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:179) > at > org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:130) > at > org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:355) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:400) > at > org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to > retrieve list of partition from metadata > at > org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:113) > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313) > ... 32 more > Caused by: java.util.NoSuchElementException: No value present in Option > at org.apache.hudi.common.util.Option.get(Option.java:89) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionFileSlices(HoodieTableMetadataUtil.java:1057) > at > org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(HoodieTableMetadataUtil.java:1001) > at >
[jira] [Created] (HUDI-4952) Reading from metadata table could fail when there are no completed commits
sivabalan narayanan created HUDI-4952: - Summary: Reading from metadata table could fail when there are no completed commits Key: HUDI-4952 URL: https://issues.apache.org/jira/browse/HUDI-4952 Project: Apache Hudi Issue Type: Bug Components: metadata Reporter: sivabalan narayanan When metadata table is just getting initialized, but first commit is not yet fully complete, reading from metadata table could fail w/ below stacktrace. {code:java} 22/08/20 02:56:58 ERROR client.RemoteDriver: Failed to run client job 39d720db-b15d-4823-b8b1-54398b143d6e org.apache.hudi.exception.HoodieException: Error fetching partition paths from metadata table at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315) at org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:176) at org.apache.hudi.BaseHoodieTableFileIndex.loadPartitionPathFiles(BaseHoodieTableFileIndex.java:219) at org.apache.hudi.BaseHoodieTableFileIndex.doRefresh(BaseHoodieTableFileIndex.java:264) at org.apache.hudi.BaseHoodieTableFileIndex.(BaseHoodieTableFileIndex.java:139) at org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:49) at org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:234) at org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:141) at org.apache.hudi.hadoop.HoodieParquetInputFormatBase.listStatus(HoodieParquetInputFormatBase.java:90) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.listStatus(HoodieCombineHiveInputFormat.java:889) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217) at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getSplits(HoodieCombineHiveInputFormat.java:942) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getCombineSplits(HoodieCombineHiveInputFormat.java:241) at org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getSplits(HoodieCombineHiveInputFormat.java:363) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267) at org.apache.spark.api.java.JavaRDDLike$class.getNumPartitions(JavaRDDLike.scala:65) at org.apache.spark.api.java.AbstractJavaRDDLike.getNumPartitions(JavaRDDLike.scala:45) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:252) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:179) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:130) at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:355) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:400) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of partition from metadata at org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:113) at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313) ... 32 more Caused by: java.util.NoSuchElementException: No value present in Option at org.apache.hudi.common.util.Option.get(Option.java:89) at org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionFileSlices(HoodieTableMetadataUtil.java:1057) at org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(HoodieTableMetadataUtil.java:1001) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getPartitionFileSliceToKeysMapping(HoodieBackedTableMetadata.java:377) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:204) at org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:140) at org.apache.hudi.metadata.BaseTableMetadata.fetchAllPartitionPaths(BaseTableMetadata.java:281) at
[GitHub] [hudi] wwli05 opened a new issue, #6835: [SUPPORT] hive doesnt support mor read now, pls confirm
wwli05 opened a new issue, #6835: URL: https://github.com/apache/hudi/issues/6835 from HoodieRealtimeRecordReader, it says support merge on read record reading, but from my test, it only return data from the log file. i looked the RealtimeCompactedRecordReader, public boolean next(NullWritable aVoid, ArrayWritable arrayWritable) throws IOException { while (this.parquetReader.next(aVoid, arrayWritable)) { if (!deltaRecordMap.isEmpty()) { String key = arrayWritable.get()[recordKeyIndex].toString(); if (deltaRecordMap.containsKey(key)) { this.deltaRecordKeys.remove(key); Option rec = buildGenericRecordwithCustomPayload(deltaRecordMap.get(key));/**/ 1. in this method,it just get the record from log file** if (!rec.isPresent()) { continue; } setUpWritable(rec, arrayWritable, key); // **2. in this method, it just copy ,no merge logic.** return true; } } return true; } return false; } **3. so i think , hive does't support merge on read record reading now, can someone confirm this?** **4. if want to support mor read ,in buildGenericRecordwithCustomPayload, it should pass current value from parque ,and invoke combineAngeGetUpdateValue instead of getInsertValue, am i right?** the current buildGenericRecordwithCustomPayload logic private Option buildGenericRecordwithCustomPayload(HoodieRecord record) throws IOException { if (usesCustomPayload) { return ((HoodieAvroRecord) record).getData().getInsertValue(getWriterSchema(), payloadProps); } else { return ((HoodieAvroRecord) record).getData().getInsertValue(getReaderSchema(), payloadProps); } } -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #6705: [HUDI-4868] Fixed the issue that compaction is invalid when the last commit action is replace commit.
nsivabalan commented on code in PR #6705: URL: https://github.com/apache/hudi/pull/6705#discussion_r984172615 ## hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java: ## @@ -214,22 +216,22 @@ public static List getPendingCompactionInstantTimes(HoodieTableMe */ public static Option> getDeltaCommitsSinceLatestCompaction( HoodieActiveTimeline activeTimeline) { -Option lastCompaction = activeTimeline.getCommitTimeline() +Option lastCompaction = activeTimeline.getCommitTimeline().filter(s -> !s.getAction().equals(REPLACE_COMMIT_ACTION)) .filterCompletedInstants().lastInstant(); -HoodieTimeline deltaCommits = activeTimeline.getDeltaCommitTimeline(); +HoodieTimeline deltaAndReplaceCommits = activeTimeline.getDeltaCommitAndReplaceCommitTimeline(); Review Comment: but I am not sure if this makes sense. this method `getDeltaCommitsSinceLatestCompaction` only cares for delta commits for the purpose of scheduling compaction. So, replace commits does not matter. can you help me understand why we need to include repalce commits here. ## hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java: ## @@ -214,22 +216,22 @@ public static List getPendingCompactionInstantTimes(HoodieTableMe */ public static Option> getDeltaCommitsSinceLatestCompaction( HoodieActiveTimeline activeTimeline) { -Option lastCompaction = activeTimeline.getCommitTimeline() +Option lastCompaction = activeTimeline.getCommitTimeline().filter(s -> !s.getAction().equals(REPLACE_COMMIT_ACTION)) Review Comment: I agree this fix makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6805: [HUDI-4949] optimize cdc read to avoid the problem of reusing buffer underlying the Row
hudi-bot commented on PR #6805: URL: https://github.com/apache/hudi/pull/6805#issuecomment-1263034687 ## CI report: * 573c27aef34708f1b6019f0647a0ef7093c3a96a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11822) * 075b993b608134f15eff7cab96b8e916369ae722 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11918) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6805: [HUDI-4949] optimize cdc read to avoid the problem of reusing buffer underlying the Row
hudi-bot commented on PR #6805: URL: https://github.com/apache/hudi/pull/6805#issuecomment-1263032260 ## CI report: * 573c27aef34708f1b6019f0647a0ef7093c3a96a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11822) * 075b993b608134f15eff7cab96b8e916369ae722 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6830: [HUDI-2118] Skip checking corrupt log blocks for transactional write file systems
hudi-bot commented on PR #6830: URL: https://github.com/apache/hudi/pull/6830#issuecomment-1263026890 ## CI report: * 6ab358154bb350a68340c9e8b9cafd0de260252c Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11897) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11917) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6751: [MINOR] Fixes to make unit tests work on m1
hudi-bot commented on PR #6751: URL: https://github.com/apache/hudi/pull/6751#issuecomment-1263026731 ## CI report: * c7a1d373796e8bfce040bd79a07f68ef6b7ffc59 UNKNOWN * 287c52c6da5eb75093f3c9f7bfd5bfaf0eeb9ac0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11911) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6793: 【HUDI-4917】Optimized the way to get HoodieBaseFile of loadColumnRange…
hudi-bot commented on PR #6793: URL: https://github.com/apache/hudi/pull/6793#issuecomment-1263026777 ## CI report: * 32cc352122d276f5bb5943a0dd420920854fdb8e Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11837) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11916) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on a diff in pull request #6793: 【HUDI-4917】Optimized the way to get HoodieBaseFile of loadColumnRange…
boneanxs commented on code in PR #6793: URL: https://github.com/apache/hudi/pull/6793#discussion_r984153227 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java: ## @@ -161,19 +162,19 @@ private List> getBloomIndexFileInfoForPartition List> loadColumnRangesFromFiles( List partitions, final HoodieEngineContext context, final HoodieTable hoodieTable) { // Obtain the latest data files from all the partitions. -List> partitionPathFileIDList = getLatestBaseFilesForAllPartitions(partitions, context, hoodieTable).stream() -.map(pair -> Pair.of(pair.getKey(), pair.getValue().getFileId())) +List>> partitionPathFileIDList = getLatestBaseFilesForAllPartitions(partitions, context, hoodieTable).stream() +.map(pair -> Pair.of(pair.getKey(), Pair.of(pair.getValue().getFileId(), pair.getValue( .collect(toList()); context.setJobStatus(this.getClass().getName(), "Obtain key ranges for file slices (range pruning=on): " + config.getTableName()); return context.map(partitionPathFileIDList, pf -> { try { -HoodieRangeInfoHandle rangeInfoHandle = new HoodieRangeInfoHandle(config, hoodieTable, pf); -String[] minMaxKeys = rangeInfoHandle.getMinMaxKeys(); -return Pair.of(pf.getKey(), new BloomIndexFileInfo(pf.getValue(), minMaxKeys[0], minMaxKeys[1])); +HoodieRangeInfoHandle rangeInfoHandle = new HoodieRangeInfoHandle(config, hoodieTable, Pair.of(pf.getKey(), pf.getValue().getKey())); +String[] minMaxKeys = rangeInfoHandle.getMinMaxKeys(pf.getValue().getValue()); Review Comment: I think `HoodieRangeInfoHandle` is bind to a file slice, but here you break the class meaning to allow it handle different files. Maybe we can change the class construct to accept `BaseFile`, while keep the method as it is before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6745: Fix comment in RFC46
hudi-bot commented on PR #6745: URL: https://github.com/apache/hudi/pull/6745#issuecomment-1263026678 ## CI report: * f2823f9cfd431f63e8026cd4a4d4680cd842a660 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11910) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6815: [HUDI-4937] Fix `HoodieTable` injecting non-reusable `HoodieBackedTableMetadata` aggressively flushing MT readers
hudi-bot commented on PR #6815: URL: https://github.com/apache/hudi/pull/6815#issuecomment-1263026828 ## CI report: * 12160b8c178ef5bd2721727207c41fdfa2f40e8f Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11883) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11912) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #6793: 【HUDI-4917】Optimized the way to get HoodieBaseFile of loadColumnRange…
boneanxs commented on PR #6793: URL: https://github.com/apache/hudi/pull/6793#issuecomment-1263026464 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] giftbowen commented on pull request #6830: [HUDI-2118] Skip checking corrupt log blocks for transactional write file systems
giftbowen commented on PR #6830: URL: https://github.com/apache/hudi/pull/6830#issuecomment-1263021647 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] scxwhite commented on issue #6687: [SUPPORT] Poor Upsert Performance on COW table due to indexing
scxwhite commented on issue #6687: URL: https://github.com/apache/hudi/issues/6687#issuecomment-1263019514 You can see how to use these indexes in the [official documents.](https://hudi.apache.org/docs/basic_configurations#index-configs) If you want to know more about bucket index. Take a look at this [document](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yuzhaojing closed pull request #6823: [Do Not Merge] test for 0.12.1 rc1
yuzhaojing closed pull request #6823: [Do Not Merge] test for 0.12.1 rc1 URL: https://github.com/apache/hudi/pull/6823 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #6725: [HUDI-4881] Push down filters if possible when syncing partitions to Hive
boneanxs commented on PR #6725: URL: https://github.com/apache/hudi/pull/6725#issuecomment-1263013312 @codope @yihua @alexeykudinkin @xushiyan Hi, could you plz take a look this improvement? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-4879) MERGE INTO fails when setting "hoodie.datasource.write.payload.class"
[ https://issues.apache.org/jira/browse/HUDI-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin closed HUDI-4879. - Resolution: Fixed > MERGE INTO fails when setting "hoodie.datasource.write.payload.class" > - > > Key: HUDI-4879 > URL: https://issues.apache.org/jira/browse/HUDI-4879 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Jian Feng >Priority: Blocker > Fix For: 0.12.1 > > > As reported by the user: > [https://github.com/apache/hudi/issues/6354] > > Currently, setting \{{hoodie.datasource.write.payload.class = > 'org.apache.hudi.common.model.DefaultHoodieRecordPayload' }}will result in > the following exception: > {code:java} > org.apache.hudi.exception.HoodieUpsertExceptio > n: Error upserting bucketType UPDATE for partition :0 at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244) > at > org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102) > at > org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386) > at > org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1498) > at > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merg > e new record with old value in storage, for new record > {HoodieRecord{key=HoodieKey { recordKey=id:1 partitionPath=}, > currentLocation='HoodieRecordLocation {instantTime=20220810095846644, > fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}', > newLocation='HoodieRecordLocation {instantTime=20220810101719437, > fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}'}}, old value > {{"_hoodie_commit_time": "20220810095824514", "_hoodie_commit_seqno": > "20220810095824514_0_0", "_hoodie_record_key": "id:1", > "_hoodie_partition_path": "", "_hoodie_file_name": > "60c04f95-ca5e-4f82-9558-40da29cc022e-0_0-937-24808_20220810095846644.parquet", > "id": 1, "name": "a0", "ts": 1000}} at > org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322) > ... 28 more > Caused by: org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merge new >
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency
alexeykudinkin commented on code in PR #5416: URL: https://github.com/apache/hudi/pull/5416#discussion_r984089070 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -230,6 +240,16 @@ public class HoodieWriteConfig extends HoodieConfig { .defaultValue(String.valueOf(4 * 1024 * 1024)) .withDocumentation("Size of in-memory buffer used for parallelizing network reads and lake storage writes."); + public static final ConfigProperty WRITE_BUFFER_SIZE = ConfigProperty + .key("hoodie.write.buffer.size") + .defaultValue(1024) + .withDocumentation("The size of the Disruptor Executor ring buffer, must be power of 2"); + + public static final ConfigProperty WRITE_WAIT_STRATEGY = ConfigProperty Review Comment: Same as above ## hudi-common/src/main/java/org/apache/hudi/common/util/queue/HoodieExecutor.java: ## @@ -0,0 +1,36 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.util.queue; + +import java.util.concurrent.ExecutorCompletionService; + +public abstract class HoodieExecutor { Review Comment: - Let's convert this to an interface - Please add a java-doc ## hudi-common/src/main/java/org/apache/hudi/common/util/queue/HoodieDaemonThreadFactory.java: ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.util.queue; + +import org.jetbrains.annotations.NotNull; +import java.util.concurrent.ThreadFactory; + +public class HoodieDaemonThreadFactory implements ThreadFactory { + + private Runnable preExecuteRunnable; + + public HoodieDaemonThreadFactory(Runnable preExecuteRunnable) { +this.preExecuteRunnable = preExecuteRunnable; Review Comment: Can you help me understand what kind of prologues we're planning to execute here? ## hudi-common/src/main/java/org/apache/hudi/common/util/queue/HoodieDaemonThreadFactory.java: ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.util.queue; + +import org.jetbrains.annotations.NotNull; +import java.util.concurrent.ThreadFactory; + +public class HoodieDaemonThreadFactory implements ThreadFactory { + + private Runnable preExecuteRunnable; + + public HoodieDaemonThreadFactory(Runnable preExecuteRunnable) { Review Comment: If we're planning to have a custom factory it's a good idea to add custom name to the threads it produces (for them to be more easily identifiable) ## hudi-common/src/main/java/org/apache/hudi/common/util/queue/DisruptorMessageQueue.java: ## @@
[GitHub] [hudi] hudi-bot commented on pull request #6741: [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table
hudi-bot commented on PR #6741: URL: https://github.com/apache/hudi/pull/6741#issuecomment-1262994744 ## CI report: * bff3acafde6d8a1bd5574b90ce644ef30acbf0a2 UNKNOWN * e39d50d6242e272f867c9987a8a2e97ca323568f Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11886) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11915) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #6741: [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table
xiarixiaoyao commented on PR #6741: URL: https://github.com/apache/hudi/pull/6741#issuecomment-1262994330 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6831: [DO NOT MERGE] doing a test
hudi-bot commented on PR #6831: URL: https://github.com/apache/hudi/pull/6831#issuecomment-1262989566 ## CI report: * abde5c46b45518257866a3de7914352920c8c5cf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11909) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhengyuan-cn commented on issue #6596: [SUPPORT] with Impala 4.0 Records lost
zhengyuan-cn commented on issue #6596: URL: https://github.com/apache/hudi/issues/6596#issuecomment-1262986108 > > I replaced impala hudi dependency jar (hudi-common-0.5.0-incubating.jar, hudi-hadoop-mr-0.5.0-incubating.jar) with (hudi-common-0.12.0.jar, hudi-hadoop-mr-0.12.0.jar),issues still. > > > ENV: impala4.0+hive3.1.1 with hudi 0.11 is correct. > > @zhengyuan-cn do you mean you replaced `hudi-*-0.5.0` with `hudi-*-0.11.0` and it worked? > hi xushiyan ,I debugged in flink + hudi local mode, I found CleanPlanner deleted my partion, I have three pations,(2022/09/27,2022/09/28,2022/09/29). CleanPlanner deleted partion '2022/09/27'. detial logs as below. 307113 [pool-16-thread-1] INFO org.apache.hudi.table.action.clean.CleanPlanner - 1 patterns used to delete in partition path:2022/09/27 307113 [pool-16-thread-1] INFO org.apache.hudi.table.action.clean.CleanPlanner - Partition 2022/09/27 to be deleted --- detail log : `306975 [pool-16-thread-1] INFO org.apache.hudi.common.table.view.AbstractTableFileSystemView - Took 62 ms to read 16 instants, 171 replaced file groups 306998 [pool-16-thread-1] INFO org.apache.hudi.common.util.ClusteringUtils - Found 109 files in pending clustering operations 306998 [pool-16-thread-1] INFO org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView - Sending request : (http://192.168.1.75:58989/v1/hoodie/view/compactions/pending/?basepath=hdfs%3A%2F%2Fhadoop01%3A9000%2Fhudi%2Fcow-intact-4=20220929165857714=3446cb10ee80b94e6b37ad4052890146807bbf579bd20bed86c2e7564d09b62d) 307014 [qtp805746605-86] INFO org.apache.hudi.timeline.service.RequestHandler - Syncing view as client passed last known instant 20220929165857714 as last known instant but server has the following last instant on timeline :Option{val=[20220929165857714__commit__COMPLETED]} 307018 [qtp805746605-86] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants upto : Option{val=[==>20220929165927744__commit__INFLIGHT]} 307049 [qtp805746605-86] INFO org.apache.hudi.common.table.view.AbstractTableFileSystemView - Took 31 ms to read 16 instants, 171 replaced file groups 307072 [qtp805746605-86] INFO org.apache.hudi.common.util.ClusteringUtils - Found 109 files in pending clustering operations 307078 [pool-16-thread-1] INFO org.apache.hudi.table.action.clean.CleanPlanner - Incremental Cleaning mode is enabled. Looking up partition-paths that have since changed since last cleaned at 20220929164457499. New Instant to retain : Option{val=[20220929164559700__replacecommit__COMPLETED]} 307086 [pool-16-thread-1] INFO org.apache.hudi.table.action.clean.CleanPlanner - Total Partitions to clean : 3, with policy KEEP_LATEST_COMMITS 307086 [pool-16-thread-1] INFO org.apache.hudi.table.action.clean.CleanPlanner - Using cleanerParallelism: 3 307086 [pool-16-thread-1] INFO org.apache.hudi.table.action.clean.CleanPlanner - Cleaning 2022/09/27, retaining latest 30 commits. 307086 [ForkJoinPool.commonPool-worker-6] INFO org.apache.hudi.table.action.clean.CleanPlanner - Cleaning 2022/09/28, retaining latest 30 commits. 307086 [ForkJoinPool.commonPool-worker-11] INFO org.apache.hudi.table.action.clean.CleanPlanner - Cleaning 2022/09/29, retaining latest 30 commits. 307087 [ForkJoinPool.commonPool-worker-6] INFO org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView - Sending request : (http://192.168.1.75:58989/v1/hoodie/view/filegroups/replaced/before/?partition=2022%2F09%2F28=20220929164559700=hdfs%3A%2F%2Fhadoop01%3A9000%2Fhudi%2Fcow-intact-4=20220929165857714=3446cb10ee80b94e6b37ad4052890146807bbf579bd20bed86c2e7564d09b62d) 307087 [ForkJoinPool.commonPool-worker-11] INFO org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView - Sending request : (http://192.168.1.75:58989/v1/hoodie/view/filegroups/replaced/before/?partition=2022%2F09%2F29=20220929164559700=hdfs%3A%2F%2Fhadoop01%3A9000%2Fhudi%2Fcow-intact-4=20220929165857714=3446cb10ee80b94e6b37ad4052890146807bbf579bd20bed86c2e7564d09b62d) 307087 [pool-16-thread-1] INFO org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView - Sending request : (http://192.168.1.75:58989/v1/hoodie/view/filegroups/replaced/before/?partition=2022%2F09%2F27=20220929164559700=hdfs%3A%2F%2Fhadoop01%3A9000%2Fhudi%2Fcow-intact-4=20220929165857714=3446cb10ee80b94e6b37ad4052890146807bbf579bd20bed86c2e7564d09b62d) 307089 [qtp805746605-535] INFO org.apache.hudi.common.table.view.AbstractTableFileSystemView - Building file system view for partition (2022/09/27) 307090 [qtp805746605-535] INFO org.apache.hudi.common.table.view.AbstractTableFileSystemView - addFilesToView: NumFiles=3, NumFileGroups=2, FileGroupsCreationTime=0, StoreTimeTaken=0 307093 [qtp805746605-81] INFO
[GitHub] [hudi] zhengyuan-cn opened a new issue, #6596: [SUPPORT] with Impala 4.0 Records lost
zhengyuan-cn opened a new issue, #6596: URL: https://github.com/apache/hudi/issues/6596 ENV: impala4.0+hive3.1.1 with hudi 0.12 via impala shell execute sql: select count(*) from tableName; return rows count is (195264946) less than actuall rows 217884008. but by spark SQL return 217884008 rows, is correct result . I refresh tableName mutl times then still uncorrect result. I replaced impala hudi dependency jar (hudi-common-0.5.0-incubating.jar, hudi-hadoop-mr-0.5.0-incubating.jar) with (hudi-common-0.12.0.jar, hudi-hadoop-mr-0.12.0.jar),issues still. ENV: impala4.0+hive3.1.1 with hudi 0.11 is correct. **Environment Description** * Hudi version : 0.12 * Spark version : spark-2.4.8 * Hive version : 3.1.1 (with impala comes with it ) * Hadoop version : hadoop-3.2.2 * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no Additional : Impala: `[192.168.1.52:21000] hudi> refresh model_series_data_3; Connection lost, reconnecting... Opened TCP connection to 192.168.1.52:21000 Query: use `hudi` Query: refresh model_series_data_3 Query submitted at: 2022-09-05 07:07:44 (Coordinator: http://192.168.10.52:25000) Query progress can be monitored at: http://192.168.1.52:25000/query_plan?query_id=b34a6e2e71c0af91:2521ad2d Fetched 0 row(s) in 0.28s [192.168.1.52:21000] hudi> select count(*) from model_series_data_3; Query: select count(*) from model_series_data_3 Query submitted at: 2022-09-05 07:07:46 (Coordinator: http://192.168.10.52:25000) Query progress can be monitored at: http://192.168.1.52:25000/query_plan?query_id=f848080d361104ad:ebb3af9a +---+ | count(*) | +---+ | 195264946 | +---+ Fetched 1 row(s) in 2.72s` == Spark : `+-+ | count(1)| +-+ |217884008| +-+ 16:30:59,796 INFO AbstractConnector:381 - Stopped Spark@47da3952{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} 16:30:59,797 INFO SparkUI:54 - Stopped Spark web UI at http://192.168.2.56:4040` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhengyuan-cn commented on issue #6596: [SUPPORT] with Impala 4.0 Records lost
zhengyuan-cn commented on issue #6596: URL: https://github.com/apache/hudi/issues/6596#issuecomment-1262984003 > > > I replaced impala hudi dependency jar (hudi-common-0.5.0-incubating.jar, hudi-hadoop-mr-0.5.0-incubating.jar) with (hudi-common-0.12.0.jar, hudi-hadoop-mr-0.12.0.jar),issues still. > > > > > > > ENV: impala4.0+hive3.1.1 with hudi 0.11 is correct. > > > > > > @zhengyuan-cn do you mean you replaced `hudi-*-0.5.0` with `hudi-*-0.11.0` and it worked? > > NO, in env( impala4.0+hive3.1.1 with hudi 0.11) it worked, and result is correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhengyuan-cn closed issue #6596: [SUPPORT] with Impala 4.0 Records lost
zhengyuan-cn closed issue #6596: [SUPPORT] with Impala 4.0 Records lost URL: https://github.com/apache/hudi/issues/6596 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6817: [HUDI-4942] Fix RowSource schema provider
hudi-bot commented on PR #6817: URL: https://github.com/apache/hudi/pull/6817#issuecomment-1262945708 ## CI report: * e1589ebfa7aea943040a85de3b93a4613b365d83 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11908) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan merged pull request #6355: [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand
nsivabalan merged PR #6355: URL: https://github.com/apache/hudi/pull/6355 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6665: [HUDI-4850] Incremental Ingestion from GCS
hudi-bot commented on PR #6665: URL: https://github.com/apache/hudi/pull/6665#issuecomment-1262893169 ## CI report: * 4864b65515d6e9ea5b6ba9d83241cfc310cbf3ee UNKNOWN * 5ed92a20666863315f41578a905dd6f2681a1363 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11907) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6575: [HUDI-4754] Add compliance check in github actions
hudi-bot commented on PR #6575: URL: https://github.com/apache/hudi/pull/6575#issuecomment-1262892850 ## CI report: * 1600e31836157c8d05e3bc8b9e08e1717471f1a6 UNKNOWN * 4d02f2c64a5fc4b89889677ee639a20b53cec26a UNKNOWN * 48147d19c835e7868102fd2d083659e6ee2ac343 UNKNOWN * b524fcc1dc3a5ce4d32a1238e09b9cc58b3e26b6 UNKNOWN * 3f2440a00e10b2c2daa4d930fd2933d48f5be1a2 UNKNOWN * 5dfc76a457a1ef80cc87d35a2bd24bab01edfd5b UNKNOWN * 51979ee5abe5df950a320e0b0ba02532c589432d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11906) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand (#6355)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 15ca7a3060 [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand (#6355) 15ca7a3060 is described below commit 15ca7a306058c5d8c708b5310cb92f213f8d5834 Author: 冯健 AuthorDate: Fri Sep 30 06:34:00 2022 +0800 [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand (#6355) Co-authored-by: jian.feng --- .../hudi/command/MergeIntoHoodieTableCommand.scala | 6 ++-- .../spark/sql/hudi/TestMergeIntoTable2.scala | 40 +- 2 files changed, 42 insertions(+), 4 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala index 2761a00205..f0394ad379 100644 --- a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala +++ b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala @@ -509,7 +509,8 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie val targetTableDb = targetTableIdentify.database.getOrElse("default") val targetTableName = targetTableIdentify.identifier val path = hoodieCatalogTable.tableLocation -val catalogProperties = hoodieCatalogTable.catalogProperties +// force to use ExpressionPayload as WRITE_PAYLOAD_CLASS_NAME in MergeIntoHoodieTableCommand +val catalogProperties = hoodieCatalogTable.catalogProperties + (PAYLOAD_CLASS_NAME.key -> classOf[ExpressionPayload].getCanonicalName) val tableConfig = hoodieCatalogTable.tableConfig val tableSchema = hoodieCatalogTable.tableSchema val partitionColumns = tableConfig.getPartitionFieldProp.split(",").map(_.toLowerCase) @@ -523,14 +524,13 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie val hoodieProps = getHoodieProps(catalogProperties, tableConfig, sparkSession.sqlContext.conf) val hiveSyncConfig = buildHiveSyncConfig(hoodieProps, hoodieCatalogTable) -withSparkConf(sparkSession, hoodieCatalogTable.catalogProperties) { +withSparkConf(sparkSession, catalogProperties) { Map( "path" -> path, RECORDKEY_FIELD.key -> tableConfig.getRecordKeyFieldProp, PRECOMBINE_FIELD.key -> preCombineField, TBL_NAME.key -> hoodieCatalogTable.tableName, PARTITIONPATH_FIELD.key -> tableConfig.getPartitionFieldProp, -PAYLOAD_CLASS_NAME.key -> classOf[ExpressionPayload].getCanonicalName, HIVE_STYLE_PARTITIONING.key -> tableConfig.getHiveStylePartitioningEnable, URL_ENCODE_PARTITIONING.key -> tableConfig.getUrlEncodePartitioning, KEYGENERATOR_CLASS_NAME.key -> classOf[SqlKeyGenerator].getCanonicalName, diff --git a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala index 8e6acd1be5..8a6aa9691d 100644 --- a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala +++ b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala @@ -674,7 +674,7 @@ class TestMergeIntoTable2 extends HoodieSparkSqlTestBase { } } - test ("Test Merge into with String cast to Double") { + test("Test Merge into with String cast to Double") { withTempDir { tmp => val tableName = generateTableName // Create a cow partitioned table. @@ -713,4 +713,42 @@ class TestMergeIntoTable2 extends HoodieSparkSqlTestBase { ) } } + + test("Test Merge into where manually set DefaultHoodieRecordPayload") { +withTempDir { tmp => + val tableName = generateTableName + // Create a cow table with default payload class, check whether it will be overwritten by ExpressionPayload. + // if not, this ut cannot pass since DefaultHoodieRecordPayload can not promotion int to long when insert a ts with Integer value + spark.sql( +s""" + | create table $tableName ( + | id int, + | name string, + | ts long + | ) using hudi + | tblproperties ( + | type = 'cow', + | primaryKey = 'id', + | preCombineField = 'ts', + | hoodie.datasource.write.payload.class = 'org.apache.hudi.common.model.DefaultHoodieRecordPayload' + | ) location '${tmp.getCanonicalPath}' + """.stripMargin) + // Insert data + spark.sql(s"insert into $tableName
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6680: [HUDI-4812] lazy fetching partition path & file slice for HoodieFileIndex
alexeykudinkin commented on code in PR #6680: URL: https://github.com/apache/hudi/pull/6680#discussion_r984047423 ## hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java: ## @@ -179,15 +197,125 @@ public void close() throws Exception { } protected List getAllQueryPartitionPaths() { +if (cachedAllPartitionPaths != null) { + return cachedAllPartitionPaths; +} + +loadAllQueryPartitionPaths(); +return cachedAllPartitionPaths; + } + + private void loadAllQueryPartitionPaths() { List queryRelativePartitionPaths = queryPaths.stream() .map(path -> FSUtils.getRelativePartitionPath(basePath, path)) .collect(Collectors.toList()); -// Load all the partition path from the basePath, and filter by the query partition path. -// TODO load files from the queryRelativePartitionPaths directly. -List matchedPartitionPaths = getAllPartitionPathsUnchecked() -.stream() -.filter(path -> queryRelativePartitionPaths.stream().anyMatch(path::startsWith)) +this.cachedAllPartitionPaths = listQueryPartitionPaths(queryRelativePartitionPaths); + +// If the partition value contains InternalRow.empty, we query it as a non-partitioned table. +this.queryAsNonePartitionedTable = this.cachedAllPartitionPaths.stream().anyMatch(p -> p.values.length == 0); Review Comment: We don't need this field anymore we can use `isPartitionedTable` method ## hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java: ## @@ -179,15 +197,125 @@ public void close() throws Exception { } protected List getAllQueryPartitionPaths() { +if (cachedAllPartitionPaths != null) { + return cachedAllPartitionPaths; +} + +loadAllQueryPartitionPaths(); +return cachedAllPartitionPaths; + } + + private void loadAllQueryPartitionPaths() { List queryRelativePartitionPaths = queryPaths.stream() .map(path -> FSUtils.getRelativePartitionPath(basePath, path)) .collect(Collectors.toList()); -// Load all the partition path from the basePath, and filter by the query partition path. -// TODO load files from the queryRelativePartitionPaths directly. -List matchedPartitionPaths = getAllPartitionPathsUnchecked() -.stream() -.filter(path -> queryRelativePartitionPaths.stream().anyMatch(path::startsWith)) +this.cachedAllPartitionPaths = listQueryPartitionPaths(queryRelativePartitionPaths); + +// If the partition value contains InternalRow.empty, we query it as a non-partitioned table. +this.queryAsNonePartitionedTable = this.cachedAllPartitionPaths.stream().anyMatch(p -> p.values.length == 0); + } + + protected Map> getAllInputFileSlices() { +if (!isAllInputFileSlicesCached) { + doRefresh(); +} +return cachedAllInputFileSlices; + } + + /** + * Get input file slice for the given partition. Will use cache directly if it is computed before. + */ + protected List getCachedInputFileSlices(PartitionPath partition) { +return cachedAllInputFileSlices.computeIfAbsent(partition, this::loadFileSlicesForPartition); + } + + private List loadFileSlicesForPartition(PartitionPath p) { +FileStatus[] files = loadPartitionPathFiles(p); +HoodieTimeline activeTimeline = getActiveTimeline(); +Option latestInstant = activeTimeline.lastInstant(); + +HoodieTableFileSystemView fileSystemView = new HoodieTableFileSystemView(metaClient, activeTimeline, files); + +Option queryInstant = specifiedQueryInstant.or(() -> latestInstant.map(HoodieInstant::getTimestamp)); + +validate(activeTimeline, queryInstant); + +List ret; +if (tableType.equals(HoodieTableType.MERGE_ON_READ) && queryType.equals(HoodieTableQueryType.SNAPSHOT)) { + ret = queryInstant.map(instant -> + fileSystemView.getLatestMergedFileSlicesBeforeOrOn(p.path, queryInstant.get()) + .collect(Collectors.toList()) + ) + .orElse(Collections.emptyList()); +} else { + ret = queryInstant.map(instant -> + fileSystemView.getLatestFileSlicesBeforeOrOn(p.path, instant, true) + ) + .orElse(fileSystemView.getLatestFileSlices(p.path)) + .collect(Collectors.toList()); +} + +cachedFileSize += ret.stream().mapToLong(BaseHoodieTableFileIndex::fileSliceSize).sum(); +return ret; + } + + /** + * Get partition path with the given partition value + * @param partitionNames partition names + * @param values partition values + * @return partitions that match the given partition values + */ + protected List getPartitionPaths(String[] partitionNames, String[] values) { +if (partitionNames.length == 0 || partitionNames.length != values.length) { Review Comment: Let's actually extract composing of the relative partition path (from values) into a standalone method. Then we can get eliminate this one and then just do:
[hudi] branch asf-site updated: [DOCS] Add new blogs (#6833)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 001100a4ed [DOCS] Add new blogs (#6833) 001100a4ed is described below commit 001100a4ed468aa7f384b426b0ba979a00734227 Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com> AuthorDate: Thu Sep 29 15:15:49 2022 -0700 [DOCS] Add new blogs (#6833) --- README.md | 1 + ...plementation-of-SCD-2-with-Apache-Hudi-and-Spark.mdx | 17 + ...-Data-Lake-Table-Formats-Delta-Lake-Iceberg-Hudi.mdx | 17 + 3 files changed, 35 insertions(+) diff --git a/README.md b/README.md index f688f75f15..938621247a 100644 --- a/README.md +++ b/README.md @@ -186,6 +186,7 @@ Take a look at this blog for reference - (Apache Hudi vs Delta Lake vs Apache Ic - use-case (some community users talking about their use-case) - design (technical articles talking about Hudi internal design/impl) - performance (involves performance related blogs) + - blog (anything else such as announcements/release updates/insights/guides/tutorials/concepts overview etc) 2. tag 2 - Represent individual features - clustering, compaction, ingestion, meta-sync etc. 3. tag 3 diff --git a/website/blog/2022-08-24-Implementation-of-SCD-2-with-Apache-Hudi-and-Spark.mdx b/website/blog/2022-08-24-Implementation-of-SCD-2-with-Apache-Hudi-and-Spark.mdx new file mode 100644 index 00..e876ab202e --- /dev/null +++ b/website/blog/2022-08-24-Implementation-of-SCD-2-with-Apache-Hudi-and-Spark.mdx @@ -0,0 +1,17 @@ +--- +title: "Implementation of SCD-2 (Slowly Changing Dimension) with Apache Hudi & Spark" +authors: +- name: Jayasheel Kalgal +- name: Esha Dhing +- name: Prashant Mishra +category: blog +image: /assets/images/blog/2022-08-24_implementation_of_scd_2_with_hudi_and_spark.jpeg +tags: +- use-case +- scd2 +- walmartglobaltech +--- + +import Redirect from '@site/src/components/Redirect'; + +https://medium.com/walmartglobaltech/implementation-of-scd-2-slowly-changing-dimension-with-apache-hudi-465e0eb94a5;>Redirecting... please wait!! diff --git a/website/blog/2022-09-20-Data-Lake-Lakehouse-Guide-Powered-by-Data-Lake-Table-Formats-Delta-Lake-Iceberg-Hudi.mdx b/website/blog/2022-09-20-Data-Lake-Lakehouse-Guide-Powered-by-Data-Lake-Table-Formats-Delta-Lake-Iceberg-Hudi.mdx new file mode 100644 index 00..4a67b2337d --- /dev/null +++ b/website/blog/2022-09-20-Data-Lake-Lakehouse-Guide-Powered-by-Data-Lake-Table-Formats-Delta-Lake-Iceberg-Hudi.mdx @@ -0,0 +1,17 @@ +--- +title: "Building Streaming Data Lakes with Hudi and MinIO" +authors: +- name: Matt Sarrel +category: blog +image: /assets/images/blog/2022-09-20_streaming_data_lakes_with_hudi_and_minio.png +tags: +- how-to +- datalake +- datalake-platform +- streaming ingestion +- minio +--- + +import Redirect from '@site/src/components/Redirect'; + +https://blog.min.io/streaming-data-lakes-hudi-minio/;>Redirecting... please wait!!
[GitHub] [hudi] yihua merged pull request #6833: [DOCS] Add new blogs
yihua merged PR #6833: URL: https://github.com/apache/hudi/pull/6833 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua opened a new pull request, #6834: [DOCS] Add 1.0.0 release entry to Roadmap
yihua opened a new pull request, #6834: URL: https://github.com/apache/hudi/pull/6834 ### Change Logs As above. ### Impact **Risk level: none** The website can be built and visualized. ### Documentation Update N/A. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha commented on pull request #6833: [DOCS] Add new blogs
bhasudha commented on PR #6833: URL: https://github.com/apache/hudi/pull/6833#issuecomment-1262862834 Screenshot attached from local testing https://user-images.githubusercontent.com/2179254/193150303-4a14718d-12aa-42d9-9d7d-13a4a011b385.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha opened a new pull request, #6833: [DOCS] Add new blogs
bhasudha opened a new pull request, #6833: URL: https://github.com/apache/hudi/pull/6833 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Documentation Update _Describe any necessary documentation update if there is any new feature, config, or user-facing change_ - _The config description must be updated if new configs are added or the default value of the configs are changed_ - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make changes to the website._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #5113: [HUDI-3625] [RFC-60] Optimized storage layout for Cloud Object Stores
yihua commented on code in PR #5113: URL: https://github.com/apache/hudi/pull/5113#discussion_r984045645 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,226 @@ + + +# RFC-56: Federated Storage Layer + +## Proposers +- @umehrot2 + +## Approvers +- @vinoth +- @shivnarayan + +## Status + +JIRA: [https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625) + +## Abstract + +As you scale your Apache Hudi workloads over Cloud object stores like Amazon S3, there is potential of hitting request +throttling limits which in-turn impacts performance. In this RFC, we are proposing to support an alternate storage +layout that is optimized for Amazon S3 and other cloud object stores, which helps achieve maximum throughput and +significantly reduce throttling. + +In addition, we are proposing an interface that would allow users to implement their own custom strategy to allow them +to distribute the data files across cloud stores, hdfs or on prem based on their specific use-cases. + +## Background + +Apache Hudi follows the traditional Hive storage layout while writing files on storage: +- Partitioned Tables: The files are distributed across multiple physical partition folders, under the table's base path. +- Non Partitioned Tables: The files are stored directly under the table's base path. + +While this storage layout scales well for HDFS, it increases the probability of hitting request throttle limits when +working with cloud object stores like Amazon S3 and others. This is because Amazon S3 and other cloud stores [throttle +requests based on object prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/). +Amazon S3 does scale based on request patterns for different prefixes and adds internal partitions (with their own request limits), +but there can be a 30 - 60 minute wait time before new partitions are created. Thus, all files/objects stored under the +same table path prefix could result in these request limits being hit for the table prefix, specially as workloads +scale, and there are several thousands of files being written/updated concurrently. This hurts performance due to +re-trying of failed requests affecting throughput, and result in occasional failures if the retries are not able to +succeed either and continue to be throttled. + +The traditional storage layout also tightly couples the partitions as folders under the table path. However, +some users want flexibility to be able to distribute files/partitions under multiple different paths across cloud stores, +hdfs etc. based on their specific needs. For example, customers have use cases to distribute files for each partition under +a separate S3 bucket with its individual encryption key. It is not possible to implement such use-cases with Hudi currently. + +The high level proposal here is to introduce a new storage layout strategy, where all files are distributed evenly across +multiple randomly generated prefixes under the Amazon S3 bucket, instead of being stored under a common table path/prefix. +This would help distribute the requests evenly across different prefixes, resulting in Amazon S3 to create partitions for +the prefixes each with its own request limit. This significantly reduces the possibility of hitting the request limit +for a specific prefix/partition. + +In addition, we want to expose an interface that provides users the flexibility to implement their own strategy for +distributing files if using the traditional Hive storage layout or federated storage layer (proposed in this RFC) does +not meet their use-case. + +## Design + +### Interface + +```java +/** + * Interface for providing storage file locations. + */ +public interface FederatedStorageStrategy extends Serializable { + /** + * Return a fully-qualified storage file location for the given filename. + * + * @param fileName data file name + * @return a fully-qualified location URI for a data file + */ + String storageLocation(String fileName); + + /** + * Return a fully-qualified storage file location for the given partition and filename. + * + * @param partitionPath partition path for the file + * @param fileName data file name + * @return a fully-qualified location URI for a data file + */ + String storageLocation(String partitionPath, String fileName); +} +``` + +### Generating file paths for Cloud storage optimized layout + +We want to distribute files evenly across multiple random prefixes, instead of following the traditional Hive storage +layout of keeping them under a common table path/prefix. In addition to the `Table Path`, for this new layout user will +configure another `Table Storage Path` under which the actual data files will be distributed. The original `Table Path` will +be used to maintain the table/partitions Hudi metadata. + +For the purpose of this documentation lets assume: +``` +Table Path => s3: + +Table
[GitHub] [hudi] yihua commented on a diff in pull request #5113: [HUDI-3625] [RFC-60] Optimized storage layout for Cloud Object Stores
yihua commented on code in PR #5113: URL: https://github.com/apache/hudi/pull/5113#discussion_r984024879 ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,226 @@ + + +# RFC-56: Federated Storage Layer + +## Proposers +- @umehrot2 + +## Approvers +- @vinoth +- @shivnarayan + +## Status + +JIRA: [https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625) + +## Abstract + +As you scale your Apache Hudi workloads over Cloud object stores like Amazon S3, there is potential of hitting request +throttling limits which in-turn impacts performance. In this RFC, we are proposing to support an alternate storage +layout that is optimized for Amazon S3 and other cloud object stores, which helps achieve maximum throughput and +significantly reduce throttling. + +In addition, we are proposing an interface that would allow users to implement their own custom strategy to allow them +to distribute the data files across cloud stores, hdfs or on prem based on their specific use-cases. + +## Background + +Apache Hudi follows the traditional Hive storage layout while writing files on storage: +- Partitioned Tables: The files are distributed across multiple physical partition folders, under the table's base path. +- Non Partitioned Tables: The files are stored directly under the table's base path. + +While this storage layout scales well for HDFS, it increases the probability of hitting request throttle limits when +working with cloud object stores like Amazon S3 and others. This is because Amazon S3 and other cloud stores [throttle +requests based on object prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/). +Amazon S3 does scale based on request patterns for different prefixes and adds internal partitions (with their own request limits), +but there can be a 30 - 60 minute wait time before new partitions are created. Thus, all files/objects stored under the +same table path prefix could result in these request limits being hit for the table prefix, specially as workloads +scale, and there are several thousands of files being written/updated concurrently. This hurts performance due to +re-trying of failed requests affecting throughput, and result in occasional failures if the retries are not able to +succeed either and continue to be throttled. + +The traditional storage layout also tightly couples the partitions as folders under the table path. However, +some users want flexibility to be able to distribute files/partitions under multiple different paths across cloud stores, +hdfs etc. based on their specific needs. For example, customers have use cases to distribute files for each partition under +a separate S3 bucket with its individual encryption key. It is not possible to implement such use-cases with Hudi currently. + +The high level proposal here is to introduce a new storage layout strategy, where all files are distributed evenly across +multiple randomly generated prefixes under the Amazon S3 bucket, instead of being stored under a common table path/prefix. +This would help distribute the requests evenly across different prefixes, resulting in Amazon S3 to create partitions for +the prefixes each with its own request limit. This significantly reduces the possibility of hitting the request limit +for a specific prefix/partition. + +In addition, we want to expose an interface that provides users the flexibility to implement their own strategy for +distributing files if using the traditional Hive storage layout or federated storage layer (proposed in this RFC) does +not meet their use-case. + +## Design + +### Interface + +```java +/** + * Interface for providing storage file locations. + */ +public interface FederatedStorageStrategy extends Serializable { + /** + * Return a fully-qualified storage file location for the given filename. + * + * @param fileName data file name + * @return a fully-qualified location URI for a data file + */ + String storageLocation(String fileName); + + /** + * Return a fully-qualified storage file location for the given partition and filename. + * + * @param partitionPath partition path for the file + * @param fileName data file name + * @return a fully-qualified location URI for a data file + */ + String storageLocation(String partitionPath, String fileName); Review Comment: What does the `fileName` refer to here? Is it the logical file name of a base or log file in a Hudi file slice? And is this relative or absolute? ## rfc/rfc-56/rfc-56.md: ## @@ -0,0 +1,226 @@ + + +# RFC-56: Federated Storage Layer + +## Proposers +- @umehrot2 + +## Approvers +- @vinoth +- @shivnarayan + +## Status + +JIRA: [https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625) + +## Abstract + +As you scale your Apache Hudi workloads over Cloud object stores like Amazon S3, there is potential of hitting
[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
hudi-bot commented on PR #6358: URL: https://github.com/apache/hudi/pull/6358#issuecomment-1262834872 ## CI report: * 288d166c49602a4593b1e97763a467811903737d UNKNOWN * ae59f6f918a5a08535b73be5c3fc2f29f5e84fb9 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11879) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11913) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6355: [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand
hudi-bot commented on PR #6355: URL: https://github.com/apache/hudi/pull/6355#issuecomment-1262834668 ## CI report: * 51fe330035a595e4d65cdf58554077ed0916fd25 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11905) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6815: [HUDI-4937] Fix `HoodieTable` injecting non-reusable `HoodieBackedTableMetadata` aggressively flushing MT readers
hudi-bot commented on PR #6815: URL: https://github.com/apache/hudi/pull/6815#issuecomment-1262827821 ## CI report: * 12160b8c178ef5bd2721727207c41fdfa2f40e8f Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11883) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11912) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [DOCS] Add images for new blogs
This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 8260a6882c [DOCS] Add images for new blogs 8260a6882c is described below commit 8260a6882ca80d9995bba1880f5668576f966043 Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com> AuthorDate: Thu Sep 29 14:04:38 2022 -0700 [DOCS] Add images for new blogs --- ...24_implementation_of_scd_2_with_hudi_and_spark.jpeg | Bin 0 -> 183751 bytes ...-09-20_streaming_data_lakes_with_hudi_and_minio.png | Bin 0 -> 213834 bytes 2 files changed, 0 insertions(+), 0 deletions(-) diff --git a/website/static/assets/images/blog/2022-08-24_implementation_of_scd_2_with_hudi_and_spark.jpeg b/website/static/assets/images/blog/2022-08-24_implementation_of_scd_2_with_hudi_and_spark.jpeg new file mode 100644 index 00..deb165ec78 Binary files /dev/null and b/website/static/assets/images/blog/2022-08-24_implementation_of_scd_2_with_hudi_and_spark.jpeg differ diff --git a/website/static/assets/images/blog/2022-09-20_streaming_data_lakes_with_hudi_and_minio.png b/website/static/assets/images/blog/2022-09-20_streaming_data_lakes_with_hudi_and_minio.png new file mode 100644 index 00..364979dc31 Binary files /dev/null and b/website/static/assets/images/blog/2022-09-20_streaming_data_lakes_with_hudi_and_minio.png differ
[GitHub] [hudi] alexeykudinkin commented on issue #6758: [SUPPORT] Will metatable support partitions inside col_stat & files?
alexeykudinkin commented on issue #6758: URL: https://github.com/apache/hudi/issues/6758#issuecomment-1262814782 @Zhangshunyu we're able to do this filtering even w/o physical partitioning (thanks to relying on HFile and elaborate key encoding scheme) -- we only read the records corresponding to files (in case of Column Stats) pertaining to particular partition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #6815: [HUDI-4937] Fix `HoodieTable` injecting non-reusable `HoodieBackedTableMetadata` aggressively flushing MT readers
alexeykudinkin commented on PR #6815: URL: https://github.com/apache/hudi/pull/6815#issuecomment-1262808073 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6805: [HUDI-4949] optimize cdc read to avoid the problem of reusing buffer underlying the Row
alexeykudinkin commented on code in PR #6805: URL: https://github.com/apache/hudi/pull/6805#discussion_r984015580 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala: ## @@ -516,7 +515,7 @@ class HoodieCDCRDD( val iter = loadFileSlice(fileSlice) iter.foreach { row => val key = getRecordKey(row) - beforeImageRecords.put(key, serialize(row)) + beforeImageRecords.put(key, serialize(row, copy = true)) Review Comment: Let's add a comment explaining why we're copying here (to avoid confusion) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file
alexeykudinkin commented on PR #6358: URL: https://github.com/apache/hudi/pull/6358#issuecomment-1262807900 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nochimow commented on issue #6811: [SUPPORT] Slow upsert performance
nochimow commented on issue #6811: URL: https://github.com/apache/hudi/issues/6811#issuecomment-1262804768 Hi @nsivabalan, ~97% of the data should be inserts and the remaning are updates. The updates only touches the latest partitions. (-1 day at max) No, we are not setting any small file config in this case. Based on that, there is any tweak suggestion to decrease the index tagging stage? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3204: -- Description: {color:#172b4d}Currently, b/c Spark by default omits partition values from the data files (instead encoding them into partition paths for partitioned tables), using `TimestampBasedKeyGenerator` w/ original timestamp based-column makes it impossible to retrieve the original value (reading from Spark) even though it's persisted in the data file as well.{color} {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ import org.apache.hudi.hive.MultiPartKeysValueExtractor val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") // mor df.write.format("hudi"). option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). option("hoodie.datasource.write.recordkey.field", "id"). option("hoodie.datasource.write.partitionpath.field", "data_date"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd"). option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd"). mode(org.apache.spark.sql.SaveMode.Append). save("file:///tmp/hudi/issue_4417_mor") +---++--+--++---++---+---+--+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|age| ts| data_date| +---++--+--++---++---+---+--+ | 20220110172709324|20220110172709324...| 2| 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| | 20220110172709324|20220110172709324...| 1| 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| +---++--+--++---++---+---+--+ // can not query any data spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date = '2018-09-24'") // still can not query any data spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date = '2018/09/24'").show // cow df.write.format("hudi"). option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.datasource.write.recordkey.field", "id"). option("hoodie.datasource.write.partitionpath.field", "data_date"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd"). option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd"). mode(org.apache.spark.sql.SaveMode.Append). save("file:///tmp/hudi/issue_4417_cow") +---++--+--++---++---+---+--+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| id|name|age| ts| data_date| +---++--+--++---++---+---+--+ | 20220110172721896|20220110172721896...| 2| 2018/09/24|81cc7819-a0d1-4e6...| 2| z3| 35| v1|2018/09/24| | 20220110172721896|20220110172721896...| 1| 2018/09/23|d428019b-a829-41a...| 1| z3| 30| v1|2018/09/23| +---++--+--++---++---+---+--+ // can not query any data spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date = '2018-09-24'").show // but 2018/09/24 works spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date = '2018/09/24'").show {code} was: {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ import
[GitHub] [hudi] alexeykudinkin commented on pull request #6355: [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand
alexeykudinkin commented on PR #6355: URL: https://github.com/apache/hudi/pull/6355#issuecomment-1262804005 CI is green: https://user-images.githubusercontent.com/428277/193139753-763ed18d-ee41-4e29-9eab-850c05f99912.png;> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=11905=results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on issue #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file
alexeykudinkin commented on issue #6798: URL: https://github.com/apache/hudi/issues/6798#issuecomment-1262803683 @sstimmel this is a known issue due to how Spark treats partition-columns (by default, Spark doesn't persist them in the data files, but instead encoding them into partition path). Since we're relying on some of the Spark infra to read the data to make sure that Hudi's tables are compatible w/ Spark execution engines optimizations we're unfortunately strangled by these limitations currently, but we're actively looking for solutions there. You can find more details in the HUDI-3204 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen
[ https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3204: -- Summary: Allow original partition column value to be retrieved when using TimestampBasedKeyGen (was: spark on TimestampBasedKeyGenerator has no result when query by partition column) > Allow original partition column value to be retrieved when using > TimestampBasedKeyGen > - > > Key: HUDI-3204 > URL: https://issues.apache.org/jira/browse/HUDI-3204 > Project: Apache Hudi > Issue Type: Bug > Components: spark >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: hudi-on-call, pull-request-available, sev:critical > Fix For: 0.12.1 > > Original Estimate: 3h > Time Spent: 1h > Remaining Estimate: 1h > > > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.keygen.constant.KeyGeneratorOptions._ > import org.apache.hudi.hive.MultiPartKeysValueExtractor > val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", > "2018-09-24")).toDF("id", "name", "age", "ts", "data_date") > // mor > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor"). > option("hoodie.datasource.write.table.type", "MERGE_ON_READ"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_mor") > +---++--+--++---++---+---+--+ > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172709324|20220110172709324...| 2| > 2018/09/24|703e56d3-badb-40b...| 2| z3| 35| v1|2018-09-24| > | 20220110172709324|20220110172709324...| 1| > 2018/09/23|58fde2b3-db0e-464...| 1| z3| 30| v1|2018-09-23| > +---++--+--++---++---+---+--+ > // can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018-09-24'") > // still can not query any data > spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date > = '2018/09/24'").show > // cow > df.write.format("hudi"). > option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow"). > option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). > option("hoodie.datasource.write.recordkey.field", "id"). > option("hoodie.datasource.write.partitionpath.field", "data_date"). > option("hoodie.datasource.write.precombine.field", "ts"). > option("hoodie.datasource.write.keygenerator.class", > "org.apache.hudi.keygen.TimestampBasedKeyGenerator"). > option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING"). > option("hoodie.deltastreamer.keygen.timebased.output.dateformat", > "/MM/dd"). > option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00"). > option("hoodie.deltastreamer.keygen.timebased.input.dateformat", > "-MM-dd"). > mode(org.apache.spark.sql.SaveMode.Append). > save("file:///tmp/hudi/issue_4417_cow") > +---++--+--++---++---+---+--+ > > |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path| > _hoodie_file_name| id|name|age| ts| data_date| > +---++--+--++---++---+---+--+ > | 20220110172721896|20220110172721896...| 2| > 2018/09/24|81cc7819-a0d1-4e6...| 2| z3| 35| v1|2018/09/24| | > 20220110172721896|20220110172721896...| 1| > 2018/09/23|d428019b-a829-41a...| 1| z3| 30| v1|2018/09/23| >
[jira] [Updated] (HUDI-4879) MERGE INTO fails when setting "hoodie.datasource.write.payload.class"
[ https://issues.apache.org/jira/browse/HUDI-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4879: -- Reviewers: Alexey Kudinkin > MERGE INTO fails when setting "hoodie.datasource.write.payload.class" > - > > Key: HUDI-4879 > URL: https://issues.apache.org/jira/browse/HUDI-4879 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Jian Feng >Priority: Blocker > Fix For: 0.12.1 > > > As reported by the user: > [https://github.com/apache/hudi/issues/6354] > > Currently, setting \{{hoodie.datasource.write.payload.class = > 'org.apache.hudi.common.model.DefaultHoodieRecordPayload' }}will result in > the following exception: > {code:java} > org.apache.hudi.exception.HoodieUpsertExceptio > n: Error upserting bucketType UPDATE for partition :0 at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244) > at > org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102) > at > org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386) > at > org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1498) > at > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merg > e new record with old value in storage, for new record > {HoodieRecord{key=HoodieKey { recordKey=id:1 partitionPath=}, > currentLocation='HoodieRecordLocation {instantTime=20220810095846644, > fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}', > newLocation='HoodieRecordLocation {instantTime=20220810101719437, > fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}'}}, old value > {{"_hoodie_commit_time": "20220810095824514", "_hoodie_commit_seqno": > "20220810095824514_0_0", "_hoodie_record_key": "id:1", > "_hoodie_partition_path": "", "_hoodie_file_name": > "60c04f95-ca5e-4f82-9558-40da29cc022e-0_0-937-24808_20220810095846644.parquet", > "id": 1, "name": "a0", "ts": 1000}} at > org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349) > at > org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322) > ... 28 more > Caused by: org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieUpsertException: Failed to