date:20220929

[GitHub] [hudi] felixYyu commented on a diff in pull request #5064: [HUDI-3654] Add new module `hudi-metaserver`

2022-09-29 Thread GitBox



felixYyu commented on code in PR #5064:
URL: https://github.com/apache/hudi/pull/5064#discussion_r984168332


##
hudi-metaserver/src/main/resources/mybatis/DDLMapper.xml:
##
@@ -0,0 +1,127 @@
+
+
+http://mybatis.org/dtd/mybatis-3-mapper.dtd;>
+
+
+
+CREATE TABLE dbs
+(
+db_id BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT COMMENT 'uuid',
+desc VARCHAR(512) COMMENT 'database description',
+location_uri VARCHAR(512) COMMENT 'database storage path',
+name VARCHAR(512) UNIQUE COMMENT 'database name',
+owner_name VARCHAR(512) COMMENT 'database owner',
+owner_type VARCHAR(512) COMMENT 'database type',
+create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT 'db 
created time',
+update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE 
CURRENT_TIMESTAMP COMMENT 'update time'
+) COMMENT 'databases';
+
+
+
+
+CREATE TABLE tbls
+(
+tbl_id BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT COMMENT 'uuid',
+db_id BIGINT COMMENT 'database id',
+name VARCHAR(512) COMMENT 'table name',
+create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT 'table 
created time',
+update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE 
CURRENT_TIMESTAMP COMMENT 'update time',
+owner_name VARCHAR(512) COMMENT 'table owner',
+location VARCHAR(512) COMMENT 'table location',
+UNIQUE KEY uniq_tb (db_id, name)
+) COMMENT 'tables';
+
+
+
+CREATE TABLE tbl_params
+(
+tbl_id BIGINT UNSIGNED COMMENT 'tbl id',
+param_key VARCHAR(256) COMMENT 'param_key',
+param_value VARCHAR(2048) COMMENT 'param_value',
+create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT 'parameter 
created time',
+update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE 
CURRENT_TIMESTAMP COMMENT 'update time',
+PRIMARY KEY (tbl_id, param_key)
+) COMMENT 'tbl params';
+
+
+
+CREATE TABLE partitions
+(
+part_id BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT COMMENT 'uuid',
+tbl_id BIGINT COMMENT 'table id',
+part_name VARCHAR(256) COMMENT 'partition path',
+create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT 'create 
time',
+update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE 
CURRENT_TIMESTAMP  COMMENT 'update time',
+is_deleted BOOL DEFAULT FALSE COMMENT 'whether the partition is 
deleted',
+UNIQUE uniq_partition_version (tbl_id, part_name)
+) COMMENT 'partitions';
+
+
+
+CREATE TABLE tbl_timestamp
+(
+tbl_id BIGINT UNSIGNED PRIMARY KEY COMMENT 'uuid',
+ts VARCHAR(17) COMMENT 'instant timestamp'
+) COMMENT 'generate the unique timestamp for a table';
+
+
+
+CREATE TABLE instant
+(
+instant_id BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT COMMENT 
'uuid',
+tbl_id BIGINT COMMENT 'table id',
+ts  VARCHAR(17) COMMENT 'instant timestamp',
+action TINYINT COMMENT 'commit, deltacommit, compaction, replace 
etc',
+stateTINYINT COMMENT 'completed, requested, inflight, invalid 
etc',
+duration INT  DEFAULT 0 COMMENT 'for heartbeat (s)',
+start_ts INT  DEFAULT 0 COMMENT 'for heartbeat (s)',
+UNIQUE KEY uniq_inst1 (tbl_id, state, ts, action),
+UNIQUE KEY uniq_inst2 (tbl_id, ts)
+) COMMENT 'timeline';
+
+
+
+CREATE TABLE instant_meta
+(
+commit_id BIGINT UNSIGNED PRIMARY KEY AUTO_INCREMENT COMMENT 
'uuid',
+tbl_id BIGINT COMMENT 'table id',
+ts VARCHAR(17) COMMENT 'instant timestamp',
+action TINYINT COMMENT 'commit, deltacommit, compaction, replace 
etc',
+state TINYINT COMMENT 'completed, requested, inflight, invalid 
etc',
+data LONGBLOB COMMENT 'instant metadate',

Review Comment:
   typo 'metadate'->'metadata'



##
hudi-metaserver/src/main/java/org/apache/hudi/common/table/HoodieTableMetaServerClient.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY

[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator

2022-09-29 Thread Jayasheel Kalgal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayasheel Kalgal updated HUDI-4953:
---
Description: 
Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*

 

URL -

[https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]

[https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

           

Issue : 

Classname to use for non partitioned tables should be 
{color:#0747a6}*NonpartitionedKeyGenerator*  ( currently 
*NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in 
lowercase.

 

[https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]

 

 

  was:
Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*

 

URL -

[https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]

[https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

            

 

 

Issue : 

 

Classname to use for non partitioned tables should be 
{color:#0747a6}*NonpartitionedKeyGenerator*  ( currently 
*NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in 
lowercase.

 

[https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]

 

 


> Typo in Hudi documentation about NonPartitionedKeyGenerator
> ---
>
> Key: HUDI-4953
> URL: https://issues.apache.org/jira/browse/HUDI-4953
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: docs
>Reporter: Jayasheel Kalgal
>Priority: Major
>
> Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*
>  
> URL -
> [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]
> [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]
>            
> Issue : 
> Classname to use for non partitioned tables should be 
> {color:#0747a6}*NonpartitionedKeyGenerator*  ( currently 
> *NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in 
> lowercase.
>  
> [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator

2022-09-29 Thread Jayasheel Kalgal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayasheel Kalgal updated HUDI-4953:
---
Description: 
Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*

 

URL -

[https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]

[https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

            

 

 

Issue : 

 

Classname to use for non partitioned tables should be 
{color:#0747a6}*NonpartitionedKeyGenerator*  ( currently 
*NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in 
lowercase.

 

[https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]

 

 

  was:
Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*

 

URL -

[https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]

[https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

            

 

 

Issue : 

 

Classname to use for non partitioned tables should be 
{color:#0747a6}*NonpartitionedKeyGenerator*  ( currently 
*NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in 
lowercase (Non{*}p{*}artitionedKeyGenerator)

 

[https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]

 

 


> Typo in Hudi documentation about NonPartitionedKeyGenerator
> ---
>
> Key: HUDI-4953
> URL: https://issues.apache.org/jira/browse/HUDI-4953
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: docs
>Reporter: Jayasheel Kalgal
>Priority: Major
>
> Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*
>  
> URL -
> [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]
> [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]
>             
>  
>  
> Issue : 
>  
> Classname to use for non partitioned tables should be 
> {color:#0747a6}*NonpartitionedKeyGenerator*  ( currently 
> *NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in 
> lowercase.
>  
> [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator

2022-09-29 Thread Jayasheel Kalgal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayasheel Kalgal updated HUDI-4953:
---
Description: 
Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*

 

URL -

[https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]

[https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

            

 

 

Issue : 

 

Classname to use for non partitioned tables should be 
{color:#0747a6}*NonpartitionedKeyGenerator*  ( currently 
*NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in 
lowercase (Non{*}p{*}artitionedKeyGenerator)

 

[https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]

 

 

  was:
Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*

 

URL -

[https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]

[https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

            

 

 

Issue : 

 

Classname to use for non partitioned tables should be 
{color:#0747a6}*NonpartitionedKeyGenerator*  ( currently 
*{color:#de350b}{color:#0747a6}NonPartitionedKeyGenerator{color}){color}* as 
per this repo. *P* should be in lowercase 
(Non{*}p{*}artitionedKeyGenerator){color}

 

[https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]

 

 


> Typo in Hudi documentation about NonPartitionedKeyGenerator
> ---
>
> Key: HUDI-4953
> URL: https://issues.apache.org/jira/browse/HUDI-4953
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: docs
>Reporter: Jayasheel Kalgal
>Priority: Major
>
> Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*
>  
> URL -
> [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]
> [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]
>             
>  
>  
> Issue : 
>  
> Classname to use for non partitioned tables should be 
> {color:#0747a6}*NonpartitionedKeyGenerator*  ( currently 
> *NonPartitionedKeyGenerator*{color}) as per this repo. *P* should be in 
> lowercase (Non{*}p{*}artitionedKeyGenerator)
>  
> [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator

2022-09-29 Thread Jayasheel Kalgal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayasheel Kalgal updated HUDI-4953:
---
Priority: Major  (was: Minor)

> Typo in Hudi documentation about NonPartitionedKeyGenerator
> ---
>
> Key: HUDI-4953
> URL: https://issues.apache.org/jira/browse/HUDI-4953
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: docs
>Reporter: Jayasheel Kalgal
>Priority: Major
>
> Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*
>  
> URL -
> [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]
> [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]
>             
>  
>  
> Issue : 
>  
> Classname to use for non partitioned tables should be 
> {color:#0747a6}NonpartitionedKeyGenerator {color:#172b4d}as per this repo. 
> *P* should be in lowercase (Non{*}p{*}artitionedKeyGenerator){color}{color}
>  
> [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator

2022-09-29 Thread Jayasheel Kalgal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayasheel Kalgal updated HUDI-4953:
---
Description: 
Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*

 

URL -

[https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]

[https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

            

 

 

Issue : 

 

Classname to use for non partitioned tables should be 
{color:#0747a6}NonpartitionedKeyGenerator {color:#172b4d}as per this repo. *P* 
should be in lowercase (Non{*}p{*}artitionedKeyGenerator){color}{color}

 

[https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]

 

 

  was:
Typo in Hudi documentation for 
*[{color:#172b4d}Nonpartitionedkeygenerator{color}|https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]*

 

URL -

[https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]

 
[https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

            

 

 

Issue : 

 

Classname to use for non partitioned tables should be 
{color:#0747a6}NonpartitionedKeyGenerator as per this repo.{color}

 

[https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]

 

P should be in lowercase (Non{*}p{*}artitionedKeyGenerator)


> Typo in Hudi documentation about NonPartitionedKeyGenerator
> ---
>
> Key: HUDI-4953
> URL: https://issues.apache.org/jira/browse/HUDI-4953
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: docs
>Reporter: Jayasheel Kalgal
>Priority: Minor
>
> Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*
>  
> URL -
> [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]
> [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]
>             
>  
>  
> Issue : 
>  
> Classname to use for non partitioned tables should be 
> {color:#0747a6}NonpartitionedKeyGenerator {color:#172b4d}as per this repo. 
> *P* should be in lowercase (Non{*}p{*}artitionedKeyGenerator){color}{color}
>  
> [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator

2022-09-29 Thread Jayasheel Kalgal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayasheel Kalgal updated HUDI-4953:
---
Description: 
Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*

 

URL -

[https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]

[https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

            

 

 

Issue : 

 

Classname to use for non partitioned tables should be 
{color:#0747a6}*NonpartitionedKeyGenerator*  ( currently 
*{color:#de350b}{color:#0747a6}NonPartitionedKeyGenerator{color}){color}* as 
per this repo. *P* should be in lowercase 
(Non{*}p{*}artitionedKeyGenerator){color}

 

[https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]

 

 

  was:
Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*

 

URL -

[https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]

[https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

            

 

 

Issue : 

 

Classname to use for non partitioned tables should be 
{color:#0747a6}NonpartitionedKeyGenerator {color:#172b4d}as per this repo. *P* 
should be in lowercase (Non{*}p{*}artitionedKeyGenerator){color}{color}

 

[https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]

 

 


> Typo in Hudi documentation about NonPartitionedKeyGenerator
> ---
>
> Key: HUDI-4953
> URL: https://issues.apache.org/jira/browse/HUDI-4953
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: docs
>Reporter: Jayasheel Kalgal
>Priority: Major
>
> Typo in Hudi documentation for  - *NonPartitionedKeyGenerator*
>  
> URL -
> [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]
> [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]
>             
>  
>  
> Issue : 
>  
> Classname to use for non partitioned tables should be 
> {color:#0747a6}*NonpartitionedKeyGenerator*  ( currently 
> *{color:#de350b}{color:#0747a6}NonPartitionedKeyGenerator{color}){color}* as 
> per this repo. *P* should be in lowercase 
> (Non{*}p{*}artitionedKeyGenerator){color}
>  
> [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] nsivabalan commented on issue #6800: [SUPPORT]org.apache.avro.SchemaParseException: Illegal initial character: 1Min

2022-09-29 Thread GitBox



nsivabalan commented on issue #6800:
URL: https://github.com/apache/hudi/issues/6800#issuecomment-1263123231

   if you are using deltastreamer, you can add a schema post processor and 
rename columns. if not,can't think of any easy solution apart from manually 
fixing it. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator

2022-09-29 Thread Jayasheel Kalgal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayasheel Kalgal updated HUDI-4953:
---
Description: 
Typo in Hudi documentation for 
*[{color:#172b4d}Nonpartitionedkeygenerator{color}|https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]*

 

URL -

[https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]

 
[https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

            

 

 

Issue : 

 

Classname to use for non partitioned tables should be 
{color:#0747a6}NonpartitionedKeyGenerator as per this repo.{color}

 

[https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]

 

P should be in lowercase (Non{*}p{*}artitionedKeyGenerator)

  was:
Typo in Hudi documentation for 
[Nonpartitionedkeygenerator|https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

 

URL -  
[https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

 

Issue : 

 

Classname to use for non partitioned tables should be 
{color:#0747a6}NonpartitionedKeyGenerator {color:#172b4d}as per this 
repo.{color}{color}

 

[https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]

 

P should be in lowercase 
({color:#0747a6}Non{*}p{*}artitionedKeyGenerator){color}


> Typo in Hudi documentation about NonPartitionedKeyGenerator
> ---
>
> Key: HUDI-4953
> URL: https://issues.apache.org/jira/browse/HUDI-4953
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: docs
>Reporter: Jayasheel Kalgal
>Priority: Minor
>
> Typo in Hudi documentation for 
> *[{color:#172b4d}Nonpartitionedkeygenerator{color}|https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]*
>  
> URL -
> [https://hudi.apache.org/docs/next/key_generation/#nonpartitionedkeygenerator]
>  
> [https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]
>             
>  
>  
> Issue : 
>  
> Classname to use for non partitioned tables should be 
> {color:#0747a6}NonpartitionedKeyGenerator as per this repo.{color}
>  
> [https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]
>  
> P should be in lowercase (Non{*}p{*}artitionedKeyGenerator)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] nsivabalan commented on issue #6800: [SUPPORT]org.apache.avro.SchemaParseException: Illegal initial character: 1Min

2022-09-29 Thread GitBox



nsivabalan commented on issue #6800:
URL: https://github.com/apache/hudi/issues/6800#issuecomment-1263122615

   we rely on avro's field naming conventions. looks like starting char cannot 
be numbers. 
   https://issues.apache.org/jira/browse/AVRO-153
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6804: [SUPPORT] Repairing the hudi table from No such file or directory of parquet file.

2022-09-29 Thread GitBox



nsivabalan commented on issue #6804:
URL: https://github.com/apache/hudi/issues/6804#issuecomment-1263121682

   if not for metadata table, can't think of easier way to go about this. 
essentially cleaner has cleaned up some data file which is being required by 
the query. if you have very aggressive cleaner configs, you may try to relax 
them based on the max time any query can take for the table of interest. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6825: [SUPPORT]org.apache.hudi.exception.HoodieRemoteException: *****:37568 failed to respond

2022-09-29 Thread GitBox



nsivabalan commented on issue #6825:
URL: https://github.com/apache/hudi/issues/6825#issuecomment-1263120280

   guess timeline server crashed for some reason. 
   CC @yihua any thoughts. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4953) Typo in Hudi documentation about NonPartitionedKeyGenerator

2022-09-29 Thread Jayasheel Kalgal (Jira)

Jayasheel Kalgal created HUDI-4953:
--

 Summary: Typo in Hudi documentation about 
NonPartitionedKeyGenerator
 Key: HUDI-4953
 URL: https://issues.apache.org/jira/browse/HUDI-4953
 Project: Apache Hudi
  Issue Type: Bug
  Components: docs
Reporter: Jayasheel Kalgal


Typo in Hudi documentation for 
[Nonpartitionedkeygenerator|https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

 

URL -  
[https://hudi.apache.org/blog/2021/02/13/hudi-key-generators/#nonpartitionedkeygenerator]

 

Issue : 

 

Classname to use for non partitioned tables should be 
{color:#0747a6}NonpartitionedKeyGenerator {color:#172b4d}as per this 
repo.{color}{color}

 

[https://github.com/apache/hudi/blob/15ca7a306058c5d8c708b5310cb92f213f8d5834/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/NonpartitionedKeyGenerator.java#L37]

 

P should be in lowercase 
({color:#0747a6}Non{*}p{*}artitionedKeyGenerator){color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] nsivabalan commented on issue #6835: [SUPPORT] hive doesnt support mor read now, pls confirm

2022-09-29 Thread GitBox



nsivabalan commented on issue #6835:
URL: https://github.com/apache/hudi/issues/6835#issuecomment-1263117459

   since we have a patch being actively worked on, closing the issue. thanks 
for reporting. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #6835: [SUPPORT] hive doesnt support mor read now, pls confirm

2022-09-29 Thread GitBox



nsivabalan closed issue #6835: [SUPPORT] hive doesnt support mor read now, pls 
confirm
URL: https://github.com/apache/hudi/issues/6835


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5582: [SUPPORT] NullPointerException in merge into Spark Sql HoodieSparkSqlWriter$.mergeParamsAndGetHoodieConfig

2022-09-29 Thread GitBox



nsivabalan commented on issue #5582:
URL: https://github.com/apache/hudi/issues/5582#issuecomment-1263116299

   @nitinkul @vicuna96 : gentle ping.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6503: [SUPPORT] Hudi Merge Into with larger volume

2022-09-29 Thread GitBox



nsivabalan commented on issue #6503:
URL: https://github.com/apache/hudi/issues/6503#issuecomment-1263115964

   my understanding is that, preCombine is a mandatory field for merge into 
statement. But I will let @alexeykudinkin investigate further though. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5777: [SUPPORT] Hudi table has duplicate data.

2022-09-29 Thread GitBox



nsivabalan commented on issue #5777:
URL: https://github.com/apache/hudi/issues/5777#issuecomment-1263114829

   I see you have given test data. is everything to be ingested in 1 single 
commit. or using diff commits. your reproducible script is not very clear on 
this. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5777: [SUPPORT] Hudi table has duplicate data.

2022-09-29 Thread GitBox



nsivabalan commented on issue #5777:
URL: https://github.com/apache/hudi/issues/5777#issuecomment-1263114477

   @jiangjiguang : did not realize you had give us a reproducible code snippet. 
so from what you have given above, you could see duplicate data w/ MOR RT 
query? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5777: [SUPPORT] Hudi table has duplicate data.

2022-09-29 Thread GitBox



nsivabalan commented on issue #5777:
URL: https://github.com/apache/hudi/issues/5777#issuecomment-1263111919

   sorry to have dropped the ball on this. again picking it up. 
   btw, I see this config `hoodie.datasource.write.insert.drop.duplicates` was 
proposed earlier. do not set this to true. if yes, records from incoming batch 
if they are already in storage, it will be dropped. 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] jiangbiao910 commented on issue #6462: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of partition from metadata

2022-09-29 Thread GitBox



jiangbiao910 commented on issue #6462:
URL: https://github.com/apache/hudi/issues/6462#issuecomment-1263109912

   @nsivabalan Thank you for your reply，
   if I don't set hoodie.metadata.enable"="false"，throw 
"java.lang.NoSuchMethodError: 
org.apache.hadoop.hdfs.client.HdfsDataInputStream.getReadStatistics()" 
   if  I set hoodie.metadata.enable"="false"， often not every time  throw “
   Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to 
retrieve list of partition from metadata”
   But when I run the sql again, it work well.
   I think Hbase relies on Hadoop 2.10.0，But Our environment is CDH-6.3.2 and 
hadoop version is 3.0。
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-4934) Cleaner cleans up files touched by clustering

2022-09-29 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-4934.
-
Resolution: Fixed

> Cleaner cleans up files touched by clustering
> -
>
> Key: HUDI-4934
> URL: https://issues.apache.org/jira/browse/HUDI-4934
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> I have some integration long running tests w/ cleaner and clustering. from 
> 21st or 22nd of sep, my tests have started to fail.
>  
> Reason is, when clustering kicks in, it could not find the data files to be 
> clustered. Looks like cleaner has cleaned it up. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits

2022-09-29 Thread GitBox



hudi-bot commented on PR #6836:
URL: https://github.com/apache/hudi/pull/6836#issuecomment-1263105795

   
   ## CI report:
   
   * 77223f8b87bdfcfa75045fb622b127cc4f9e47ab Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11919)
 
   * 34427d0e522bec7eee731644080bd0b5d20570dc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11921)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6818: [HUDI-4948] Improve CDC Write

2022-09-29 Thread GitBox



hudi-bot commented on PR #6818:
URL: https://github.com/apache/hudi/pull/6818#issuecomment-1263105772

   
   ## CI report:
   
   * f14363a4be66f8a05ddbbe14600176da151d04ff Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11843)
 
   * e0ccacd8d030984ed30f19b17b0dafb02d8685ee Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11920)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6793: 【HUDI-4917】Optimized the way to get HoodieBaseFile of loadColumnRange…

2022-09-29 Thread GitBox



hudi-bot commented on PR #6793:
URL: https://github.com/apache/hudi/pull/6793#issuecomment-1263105718

   
   ## CI report:
   
   * 32cc352122d276f5bb5943a0dd420920854fdb8e Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11837)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11916)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4948) Support flush and rollover for CDC Write

2022-09-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4948:
-
Labels: pull-request-available  (was: )

> Support flush and rollover for CDC Write
> 
>
> Key: HUDI-4948
> URL: https://issues.apache.org/jira/browse/HUDI-4948
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core, spark, writer-core
>Reporter: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits

2022-09-29 Thread GitBox



hudi-bot commented on PR #6836:
URL: https://github.com/apache/hudi/pull/6836#issuecomment-1263103453

   
   ## CI report:
   
   * 77223f8b87bdfcfa75045fb622b127cc4f9e47ab Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11919)
 
   * 34427d0e522bec7eee731644080bd0b5d20570dc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6818: [HUDI-4948] Improve CDC Write

2022-09-29 Thread GitBox



hudi-bot commented on PR #6818:
URL: https://github.com/apache/hudi/pull/6818#issuecomment-1263103404

   
   ## CI report:
   
   * f14363a4be66f8a05ddbbe14600176da151d04ff Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11843)
 
   * e0ccacd8d030984ed30f19b17b0dafb02d8685ee UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6741: [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table

2022-09-29 Thread GitBox



hudi-bot commented on PR #6741:
URL: https://github.com/apache/hudi/pull/6741#issuecomment-1263100954

   
   ## CI report:
   
   * bff3acafde6d8a1bd5574b90ce644ef30acbf0a2 UNKNOWN
   * e39d50d6242e272f867c9987a8a2e97ca323568f Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11886)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11915)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6101: [SUPPORT] Hudi Delete Not working with EMR, AWS Glue & S3

2022-09-29 Thread GitBox



nsivabalan commented on issue #6101:
URL: https://github.com/apache/hudi/issues/6101#issuecomment-1263071072

   @navbalaraman :  hey any updates for us. if you could not reproduce, feel 
free to close it out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6504: [SUPPORT] Hudi deletes fail in HoodieDeltaStreamer

2022-09-29 Thread GitBox



nsivabalan commented on issue #6504:
URL: https://github.com/apache/hudi/issues/6504#issuecomment-1263070852

   @santoshraj123 : gentle ping. if you got the issue resolved, feel free to 
close it out. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6428: [SUPPORT] S3 Deltastreamer: Block has already been inflated

2022-09-29 Thread GitBox



nsivabalan commented on issue #6428:
URL: https://github.com/apache/hudi/issues/6428#issuecomment-1263070600

   Since we could not reproduce w/ OSS spark, can you reach out to aws support. 
   CC @umehrot2 @rahil-c : Have you folks seen this issue before. seems like 
simple read from metadata table is failing w/ EMR spark. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6428: [SUPPORT] S3 Deltastreamer: Block has already been inflated

2022-09-29 Thread GitBox



nsivabalan commented on issue #6428:
URL: https://github.com/apache/hudi/issues/6428#issuecomment-1263069837

   yes, you are right. you can disable via hudi-cli as well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6421: [SUPPORT]Table property not working while creating table - hoodie.datasource.write.drop.partition.columns

2022-09-29 Thread GitBox



nsivabalan commented on issue #6421:
URL: https://github.com/apache/hudi/issues/6421#issuecomment-1263069591

   @sandip-yadav : gentle ping. did you get a chance to try 0.12.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wwli05 commented on issue #6835: [SUPPORT] hive doesnt support mor read now, pls confirm

2022-09-29 Thread GitBox



wwli05 commented on issue #6835:
URL: https://github.com/apache/hudi/issues/6835#issuecomment-1263069076

   thank you ,friends, really JI_SHI_YU


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits

2022-09-29 Thread GitBox



hudi-bot commented on PR #6836:
URL: https://github.com/apache/hudi/pull/6836#issuecomment-1263068964

   
   ## CI report:
   
   * 77223f8b87bdfcfa75045fb622b127cc4f9e47ab Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11919)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #6462: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of partition from metadata

2022-09-29 Thread GitBox



nsivabalan closed issue #6462: [SUPPORT]Caused by: 
org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of 
partition from metadata
URL: https://github.com/apache/hudi/issues/6462


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6462: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of partition from metadata

2022-09-29 Thread GitBox



nsivabalan commented on issue #6462:
URL: https://github.com/apache/hudi/issues/6462#issuecomment-1263068927

   closing github issue as we have a fix. thanks for reporting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits

2022-09-29 Thread GitBox



hudi-bot commented on PR #6836:
URL: https://github.com/apache/hudi/pull/6836#issuecomment-1263066749

   
   ## CI report:
   
   * 77223f8b87bdfcfa75045fb622b127cc4f9e47ab UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6462: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of partition from metadata

2022-09-29 Thread GitBox



nsivabalan commented on issue #6462:
URL: https://github.com/apache/hudi/issues/6462#issuecomment-1263064400

   https://github.com/apache/hudi/pull/6836
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-09-29 Thread GitBox



hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1263063979

   
   ## CI report:
   
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   * ae59f6f918a5a08535b73be5c3fc2f29f5e84fb9 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11879)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11913)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4952) Reading from metadata table could fail when there are no completed commits

2022-09-29 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4952:
-
Labels: pull-request-available  (was: )

> Reading from metadata table could fail when there are no completed commits
> --
>
> Key: HUDI-4952
> URL: https://issues.apache.org/jira/browse/HUDI-4952
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When metadata table is just getting initialized, but first commit is not yet 
> fully complete, reading from metadata table could fail w/ below stacktrace. 
>  
> {code:java}
> 22/08/20 02:56:58 ERROR client.RemoteDriver: Failed to run client job 
> 39d720db-b15d-4823-b8b1-54398b143d6e
> org.apache.hudi.exception.HoodieException: Error fetching partition paths 
> from metadata table
> at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:176)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.loadPartitionPathFiles(BaseHoodieTableFileIndex.java:219)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.doRefresh(BaseHoodieTableFileIndex.java:264)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.(BaseHoodieTableFileIndex.java:139)
> at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:49)
> at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:234)
> at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:141)
> at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.listStatus(HoodieParquetInputFormatBase.java:90)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.listStatus(HoodieCombineHiveInputFormat.java:889)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217)
> at 
> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getSplits(HoodieCombineHiveInputFormat.java:942)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getCombineSplits(HoodieCombineHiveInputFormat.java:241)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getSplits(HoodieCombineHiveInputFormat.java:363)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267)
> at 
> org.apache.spark.api.java.JavaRDDLike$class.getNumPartitions(JavaRDDLike.scala:65)
> at 
> org.apache.spark.api.java.AbstractJavaRDDLike.getNumPartitions(JavaRDDLike.scala:45)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:252)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:179)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:130)
> at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:355)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:400)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to 
> retrieve list of partition from metadata
> at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:113)
> at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313)
> ... 32 more
> Caused by: java.util.NoSuchElementException: No value present in Option
> at org.apache.hudi.common.util.Option.get(Option.java:89)
> at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionFileSlices(HoodieTableMetadataUtil.java:1057)
> at 
>

[GitHub] [hudi] nsivabalan opened a new pull request, #6836: [HUDI-4952] Fixing reading from metadata table when there are no inflight commits

2022-09-29 Thread GitBox



nsivabalan opened a new pull request, #6836:
URL: https://github.com/apache/hudi/pull/6836

   ### Change Logs
   
   When metadata table is just getting initialized, but first commit is not yet 
fully complete, reading from metadata table could fail w/ below stacktrace. 
   
   Call trace that could result in this. 
   ```
   BaseHoodieTableFileIndex.doRefresh() // metadata Config will have metadata 
enabled if user enables for the query session. lets assume user enabled while 
the metadata table is being built out. 
   {
   HoodieTableMetadata newTableMetadata = 
HoodieTableMetadata.create(engineContext, metadataConfig, );
  HoodieTableMetadata.create eventually will call constructor 
of HoodieBackedTableMetadata()
   within which we call initIfNeeded()
   within initIfNeeded { 
  we disable metadata only if 
table itself is not found. if not, metadata is still enabled. 
}
   
 .
  -> loadPartitionPathFiles
   }
   
   loadPartitionPathFiles {
   ...
   getAllFilesInPartitionsUnchecked()
   }
   
   getAllFilesInPartitionsUnchecked {
tableMetadata.getAllFilesInPartitions(list of interested partitions)
   }
   
   getAllFilesInPartitions{
   BaseTableMetadata.fetchAllFilesInPartitionPaths...
   }
   
   BaseTableMetadata.fetchAllFilesInPartitionPaths{
  ..
  getRecordsByKeys(...)
   }
   
   HoodieBackedTableMetadata.getRecordsByKeys{
   getPartitionFileSliceToKeysMapping()
   }
   
   getPartitionFileSliceToKeysMapping{
  List latestFileSlices =
   
HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(metadataMetaClient, 
partitionName);
   }
   
   HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices {
  HoodieTableFileSystemView fsView = 
fileSystemView.orElse(getFileSystemView(metaClient));
   Stream fileSliceStream;
   if (mergeFileSlices) { // this is true for this call graph. 
 if 
(metaClient.getActiveTimeline().filterCompletedInstants().lastInstant().isPresent())
 {
   fileSliceStream = fsView.getLatestMergedFileSlicesBeforeOrOn(
   partition, 
metaClient.getActiveTimeline().filterCompletedInstants().**lastInstant().get()**.getTimestamp());
 }
   }
   ```
   
   There is no lastInstant as the Metadata table is still being initialized. 
   
   
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on issue #6835: [SUPPORT] hive doesnt support mor read now, pls confirm

2022-09-29 Thread GitBox



xiarixiaoyao commented on issue #6835:
URL: https://github.com/apache/hudi/issues/6835#issuecomment-1263059421

   https://github.com/apache/hudi/pull/6741 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4952) Reading from metadata table could fail when there are no completed commits

2022-09-29 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4952:
--
Sprint: 2022/09/19

> Reading from metadata table could fail when there are no completed commits
> --
>
> Key: HUDI-4952
> URL: https://issues.apache.org/jira/browse/HUDI-4952
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Priority: Major
>
> When metadata table is just getting initialized, but first commit is not yet 
> fully complete, reading from metadata table could fail w/ below stacktrace. 
>  
> {code:java}
> 22/08/20 02:56:58 ERROR client.RemoteDriver: Failed to run client job 
> 39d720db-b15d-4823-b8b1-54398b143d6e
> org.apache.hudi.exception.HoodieException: Error fetching partition paths 
> from metadata table
> at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:176)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.loadPartitionPathFiles(BaseHoodieTableFileIndex.java:219)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.doRefresh(BaseHoodieTableFileIndex.java:264)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.(BaseHoodieTableFileIndex.java:139)
> at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:49)
> at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:234)
> at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:141)
> at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.listStatus(HoodieParquetInputFormatBase.java:90)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.listStatus(HoodieCombineHiveInputFormat.java:889)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217)
> at 
> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getSplits(HoodieCombineHiveInputFormat.java:942)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getCombineSplits(HoodieCombineHiveInputFormat.java:241)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getSplits(HoodieCombineHiveInputFormat.java:363)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267)
> at 
> org.apache.spark.api.java.JavaRDDLike$class.getNumPartitions(JavaRDDLike.scala:65)
> at 
> org.apache.spark.api.java.AbstractJavaRDDLike.getNumPartitions(JavaRDDLike.scala:45)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:252)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:179)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:130)
> at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:355)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:400)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to 
> retrieve list of partition from metadata
> at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:113)
> at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313)
> ... 32 more
> Caused by: java.util.NoSuchElementException: No value present in Option
> at org.apache.hudi.common.util.Option.get(Option.java:89)
> at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionFileSlices(HoodieTableMetadataUtil.java:1057)
> at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(HoodieTableMetadataUtil.java:1001)
> at 
>

[jira] [Updated] (HUDI-4952) Reading from metadata table could fail when there are no completed commits

2022-09-29 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4952:
--
Priority: Blocker  (was: Major)

> Reading from metadata table could fail when there are no completed commits
> --
>
> Key: HUDI-4952
> URL: https://issues.apache.org/jira/browse/HUDI-4952
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.12.1
>
>
> When metadata table is just getting initialized, but first commit is not yet 
> fully complete, reading from metadata table could fail w/ below stacktrace. 
>  
> {code:java}
> 22/08/20 02:56:58 ERROR client.RemoteDriver: Failed to run client job 
> 39d720db-b15d-4823-b8b1-54398b143d6e
> org.apache.hudi.exception.HoodieException: Error fetching partition paths 
> from metadata table
> at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:176)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.loadPartitionPathFiles(BaseHoodieTableFileIndex.java:219)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.doRefresh(BaseHoodieTableFileIndex.java:264)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.(BaseHoodieTableFileIndex.java:139)
> at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:49)
> at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:234)
> at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:141)
> at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.listStatus(HoodieParquetInputFormatBase.java:90)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.listStatus(HoodieCombineHiveInputFormat.java:889)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217)
> at 
> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getSplits(HoodieCombineHiveInputFormat.java:942)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getCombineSplits(HoodieCombineHiveInputFormat.java:241)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getSplits(HoodieCombineHiveInputFormat.java:363)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267)
> at 
> org.apache.spark.api.java.JavaRDDLike$class.getNumPartitions(JavaRDDLike.scala:65)
> at 
> org.apache.spark.api.java.AbstractJavaRDDLike.getNumPartitions(JavaRDDLike.scala:45)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:252)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:179)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:130)
> at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:355)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:400)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to 
> retrieve list of partition from metadata
> at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:113)
> at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313)
> ... 32 more
> Caused by: java.util.NoSuchElementException: No value present in Option
> at org.apache.hudi.common.util.Option.get(Option.java:89)
> at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionFileSlices(HoodieTableMetadataUtil.java:1057)
> at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(HoodieTableMetadataUtil.java:1001)
> at 
>

[jira] [Assigned] (HUDI-4952) Reading from metadata table could fail when there are no completed commits

2022-09-29 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-4952:
-

Assignee: sivabalan narayanan

> Reading from metadata table could fail when there are no completed commits
> --
>
> Key: HUDI-4952
> URL: https://issues.apache.org/jira/browse/HUDI-4952
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> When metadata table is just getting initialized, but first commit is not yet 
> fully complete, reading from metadata table could fail w/ below stacktrace. 
>  
> {code:java}
> 22/08/20 02:56:58 ERROR client.RemoteDriver: Failed to run client job 
> 39d720db-b15d-4823-b8b1-54398b143d6e
> org.apache.hudi.exception.HoodieException: Error fetching partition paths 
> from metadata table
> at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:176)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.loadPartitionPathFiles(BaseHoodieTableFileIndex.java:219)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.doRefresh(BaseHoodieTableFileIndex.java:264)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.(BaseHoodieTableFileIndex.java:139)
> at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:49)
> at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:234)
> at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:141)
> at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.listStatus(HoodieParquetInputFormatBase.java:90)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.listStatus(HoodieCombineHiveInputFormat.java:889)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217)
> at 
> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getSplits(HoodieCombineHiveInputFormat.java:942)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getCombineSplits(HoodieCombineHiveInputFormat.java:241)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getSplits(HoodieCombineHiveInputFormat.java:363)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267)
> at 
> org.apache.spark.api.java.JavaRDDLike$class.getNumPartitions(JavaRDDLike.scala:65)
> at 
> org.apache.spark.api.java.AbstractJavaRDDLike.getNumPartitions(JavaRDDLike.scala:45)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:252)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:179)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:130)
> at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:355)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:400)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to 
> retrieve list of partition from metadata
> at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:113)
> at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313)
> ... 32 more
> Caused by: java.util.NoSuchElementException: No value present in Option
> at org.apache.hudi.common.util.Option.get(Option.java:89)
> at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionFileSlices(HoodieTableMetadataUtil.java:1057)
> at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(HoodieTableMetadataUtil.java:1001)
> at 
>

[jira] [Updated] (HUDI-4952) Reading from metadata table could fail when there are no completed commits

2022-09-29 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4952:
--
Fix Version/s: 0.12.1

> Reading from metadata table could fail when there are no completed commits
> --
>
> Key: HUDI-4952
> URL: https://issues.apache.org/jira/browse/HUDI-4952
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.1
>
>
> When metadata table is just getting initialized, but first commit is not yet 
> fully complete, reading from metadata table could fail w/ below stacktrace. 
>  
> {code:java}
> 22/08/20 02:56:58 ERROR client.RemoteDriver: Failed to run client job 
> 39d720db-b15d-4823-b8b1-54398b143d6e
> org.apache.hudi.exception.HoodieException: Error fetching partition paths 
> from metadata table
> at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:176)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.loadPartitionPathFiles(BaseHoodieTableFileIndex.java:219)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.doRefresh(BaseHoodieTableFileIndex.java:264)
> at 
> org.apache.hudi.BaseHoodieTableFileIndex.(BaseHoodieTableFileIndex.java:139)
> at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:49)
> at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:234)
> at 
> org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:141)
> at 
> org.apache.hudi.hadoop.HoodieParquetInputFormatBase.listStatus(HoodieParquetInputFormatBase.java:90)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.listStatus(HoodieCombineHiveInputFormat.java:889)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217)
> at 
> org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getSplits(HoodieCombineHiveInputFormat.java:942)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getCombineSplits(HoodieCombineHiveInputFormat.java:241)
> at 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getSplits(HoodieCombineHiveInputFormat.java:363)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
> at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267)
> at 
> org.apache.spark.api.java.JavaRDDLike$class.getNumPartitions(JavaRDDLike.scala:65)
> at 
> org.apache.spark.api.java.AbstractJavaRDDLike.getNumPartitions(JavaRDDLike.scala:45)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:252)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:179)
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:130)
> at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:355)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:400)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to 
> retrieve list of partition from metadata
> at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:113)
> at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313)
> ... 32 more
> Caused by: java.util.NoSuchElementException: No value present in Option
> at org.apache.hudi.common.util.Option.get(Option.java:89)
> at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionFileSlices(HoodieTableMetadataUtil.java:1057)
> at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(HoodieTableMetadataUtil.java:1001)
> at 
>

[jira] [Created] (HUDI-4952) Reading from metadata table could fail when there are no completed commits

2022-09-29 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-4952:
-

 Summary: Reading from metadata table could fail when there are no 
completed commits
 Key: HUDI-4952
 URL: https://issues.apache.org/jira/browse/HUDI-4952
 Project: Apache Hudi
  Issue Type: Bug
  Components: metadata
Reporter: sivabalan narayanan


When metadata table is just getting initialized, but first commit is not yet 
fully complete, reading from metadata table could fail w/ below stacktrace. 

 
{code:java}
22/08/20 02:56:58 ERROR client.RemoteDriver: Failed to run client job 
39d720db-b15d-4823-b8b1-54398b143d6e
org.apache.hudi.exception.HoodieException: Error fetching partition paths from 
metadata table
at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:315)
at 
org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:176)
at 
org.apache.hudi.BaseHoodieTableFileIndex.loadPartitionPathFiles(BaseHoodieTableFileIndex.java:219)
at 
org.apache.hudi.BaseHoodieTableFileIndex.doRefresh(BaseHoodieTableFileIndex.java:264)
at org.apache.hudi.BaseHoodieTableFileIndex.(BaseHoodieTableFileIndex.java:139)
at 
org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:49)
at 
org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatusForSnapshotMode(HoodieCopyOnWriteTableInputFormat.java:234)
at 
org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.listStatus(HoodieCopyOnWriteTableInputFormat.java:141)
at 
org.apache.hudi.hadoop.HoodieParquetInputFormatBase.listStatus(HoodieParquetInputFormatBase.java:90)
at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.listStatus(HoodieCombineHiveInputFormat.java:889)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217)
at 
org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76)
at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat$HoodieCombineFileInputFormatShim.getSplits(HoodieCombineHiveInputFormat.java:942)
at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getCombineSplits(HoodieCombineHiveInputFormat.java:241)
at 
org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat.getSplits(HoodieCombineHiveInputFormat.java:363)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.RDD.getNumPartitions(RDD.scala:267)
at 
org.apache.spark.api.java.JavaRDDLike$class.getNumPartitions(JavaRDDLike.scala:65)
at 
org.apache.spark.api.java.AbstractJavaRDDLike.getNumPartitions(JavaRDDLike.scala:45)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:252)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:179)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:130)
at 
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:355)
at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:400)
at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:365)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to 
retrieve list of partition from metadata
at 
org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:113)
at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:313)
... 32 more
Caused by: java.util.NoSuchElementException: No value present in Option
at org.apache.hudi.common.util.Option.get(Option.java:89)
at 
org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionFileSlices(HoodieTableMetadataUtil.java:1057)
at 
org.apache.hudi.metadata.HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(HoodieTableMetadataUtil.java:1001)
at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.getPartitionFileSliceToKeysMapping(HoodieBackedTableMetadata.java:377)
at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:204)
at 
org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:140)
at 
org.apache.hudi.metadata.BaseTableMetadata.fetchAllPartitionPaths(BaseTableMetadata.java:281)
at

[GitHub] [hudi] wwli05 opened a new issue, #6835: [SUPPORT] hive doesnt support mor read now, pls confirm

2022-09-29 Thread GitBox



wwli05 opened a new issue, #6835:
URL: https://github.com/apache/hudi/issues/6835

   
   from HoodieRealtimeRecordReader, it says support merge on read   record 
reading, but from my test, it only return data from the log file.
   
   i looked the RealtimeCompactedRecordReader, 
   
   public boolean next(NullWritable aVoid, ArrayWritable arrayWritable) throws 
IOException {
   while (this.parquetReader.next(aVoid, arrayWritable)) {
 if (!deltaRecordMap.isEmpty()) {
   String key = arrayWritable.get()[recordKeyIndex].toString();
   if (deltaRecordMap.containsKey(key)) {
 this.deltaRecordKeys.remove(key);
 Option rec = 
buildGenericRecordwithCustomPayload(deltaRecordMap.get(key));/**/  1.  in this 
method,it just get the record from log file**
 if (!rec.isPresent()) {
   continue;
 }
 setUpWritable(rec, arrayWritable, key);  // **2. in this method, 
it just copy ,no merge logic.**
 return true;
   }
 }
 return true;
   }
   
   return false;
 }
   
   **3. so i think , hive does't support merge on read record reading now, can 
someone confirm this?**
   **4. if want to support mor read ,in buildGenericRecordwithCustomPayload, it 
should pass current value from parque ,and invoke combineAngeGetUpdateValue 
instead of getInsertValue, am i right?**
   
   the current buildGenericRecordwithCustomPayload logic
   private Option 
buildGenericRecordwithCustomPayload(HoodieRecord record) throws IOException {
   if (usesCustomPayload) {
 return ((HoodieAvroRecord) 
record).getData().getInsertValue(getWriterSchema(), payloadProps);
   } else {
 return ((HoodieAvroRecord) 
record).getData().getInsertValue(getReaderSchema(), payloadProps);
   }
 }
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6705: [HUDI-4868] Fixed the issue that compaction is invalid when the last commit action is replace commit.

2022-09-29 Thread GitBox



nsivabalan commented on code in PR #6705:
URL: https://github.com/apache/hudi/pull/6705#discussion_r984172615


##
hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java:
##
@@ -214,22 +216,22 @@ public static List 
getPendingCompactionInstantTimes(HoodieTableMe
*/
   public static Option> 
getDeltaCommitsSinceLatestCompaction(
   HoodieActiveTimeline activeTimeline) {
-Option lastCompaction = activeTimeline.getCommitTimeline()
+Option lastCompaction = 
activeTimeline.getCommitTimeline().filter(s -> 
!s.getAction().equals(REPLACE_COMMIT_ACTION))
 .filterCompletedInstants().lastInstant();
-HoodieTimeline deltaCommits = activeTimeline.getDeltaCommitTimeline();
+HoodieTimeline deltaAndReplaceCommits  = 
activeTimeline.getDeltaCommitAndReplaceCommitTimeline();

Review Comment:
   but I am not sure if this makes sense. this method 
`getDeltaCommitsSinceLatestCompaction` only cares for delta commits for the 
purpose of scheduling compaction. So, replace commits does not matter. can you 
help me understand why we need to include repalce commits here.



##
hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java:
##
@@ -214,22 +216,22 @@ public static List 
getPendingCompactionInstantTimes(HoodieTableMe
*/
   public static Option> 
getDeltaCommitsSinceLatestCompaction(
   HoodieActiveTimeline activeTimeline) {
-Option lastCompaction = activeTimeline.getCommitTimeline()
+Option lastCompaction = 
activeTimeline.getCommitTimeline().filter(s -> 
!s.getAction().equals(REPLACE_COMMIT_ACTION))

Review Comment:
   I agree this fix makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6805: [HUDI-4949] optimize cdc read to avoid the problem of reusing buffer underlying the Row

2022-09-29 Thread GitBox



hudi-bot commented on PR #6805:
URL: https://github.com/apache/hudi/pull/6805#issuecomment-1263034687

   
   ## CI report:
   
   * 573c27aef34708f1b6019f0647a0ef7093c3a96a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11822)
 
   * 075b993b608134f15eff7cab96b8e916369ae722 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11918)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6805: [HUDI-4949] optimize cdc read to avoid the problem of reusing buffer underlying the Row

2022-09-29 Thread GitBox



hudi-bot commented on PR #6805:
URL: https://github.com/apache/hudi/pull/6805#issuecomment-1263032260

   
   ## CI report:
   
   * 573c27aef34708f1b6019f0647a0ef7093c3a96a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11822)
 
   * 075b993b608134f15eff7cab96b8e916369ae722 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6830: [HUDI-2118] Skip checking corrupt log blocks for transactional write file systems

2022-09-29 Thread GitBox



hudi-bot commented on PR #6830:
URL: https://github.com/apache/hudi/pull/6830#issuecomment-1263026890

   
   ## CI report:
   
   * 6ab358154bb350a68340c9e8b9cafd0de260252c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11897)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11917)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6751: [MINOR] Fixes to make unit tests work on m1

2022-09-29 Thread GitBox



hudi-bot commented on PR #6751:
URL: https://github.com/apache/hudi/pull/6751#issuecomment-1263026731

   
   ## CI report:
   
   * c7a1d373796e8bfce040bd79a07f68ef6b7ffc59 UNKNOWN
   * 287c52c6da5eb75093f3c9f7bfd5bfaf0eeb9ac0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11911)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6793: 【HUDI-4917】Optimized the way to get HoodieBaseFile of loadColumnRange…

2022-09-29 Thread GitBox



hudi-bot commented on PR #6793:
URL: https://github.com/apache/hudi/pull/6793#issuecomment-1263026777

   
   ## CI report:
   
   * 32cc352122d276f5bb5943a0dd420920854fdb8e Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11837)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11916)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] boneanxs commented on a diff in pull request #6793: 【HUDI-4917】Optimized the way to get HoodieBaseFile of loadColumnRange…

2022-09-29 Thread GitBox



boneanxs commented on code in PR #6793:
URL: https://github.com/apache/hudi/pull/6793#discussion_r984153227


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java:
##
@@ -161,19 +162,19 @@ private List> 
getBloomIndexFileInfoForPartition
   List> loadColumnRangesFromFiles(
   List partitions, final HoodieEngineContext context, final 
HoodieTable hoodieTable) {
 // Obtain the latest data files from all the partitions.
-List> partitionPathFileIDList = 
getLatestBaseFilesForAllPartitions(partitions, context, hoodieTable).stream()
-.map(pair -> Pair.of(pair.getKey(), pair.getValue().getFileId()))
+List>> partitionPathFileIDList = 
getLatestBaseFilesForAllPartitions(partitions, context, hoodieTable).stream()
+.map(pair -> Pair.of(pair.getKey(), 
Pair.of(pair.getValue().getFileId(), pair.getValue(
 .collect(toList());
 
 context.setJobStatus(this.getClass().getName(), "Obtain key ranges for 
file slices (range pruning=on): " + config.getTableName());
 return context.map(partitionPathFileIDList, pf -> {
   try {
-HoodieRangeInfoHandle rangeInfoHandle = new 
HoodieRangeInfoHandle(config, hoodieTable, pf);
-String[] minMaxKeys = rangeInfoHandle.getMinMaxKeys();
-return Pair.of(pf.getKey(), new BloomIndexFileInfo(pf.getValue(), 
minMaxKeys[0], minMaxKeys[1]));
+HoodieRangeInfoHandle rangeInfoHandle = new 
HoodieRangeInfoHandle(config, hoodieTable, Pair.of(pf.getKey(), 
pf.getValue().getKey()));
+String[] minMaxKeys = 
rangeInfoHandle.getMinMaxKeys(pf.getValue().getValue());

Review Comment:
   I think `HoodieRangeInfoHandle` is bind to a file slice, but here you break 
the class meaning to allow it handle different files. Maybe we can change the 
class construct to accept `BaseFile`, while keep the method as it is before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6745: Fix comment in RFC46

2022-09-29 Thread GitBox



hudi-bot commented on PR #6745:
URL: https://github.com/apache/hudi/pull/6745#issuecomment-1263026678

   
   ## CI report:
   
   * f2823f9cfd431f63e8026cd4a4d4680cd842a660 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11910)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6815: [HUDI-4937] Fix `HoodieTable` injecting non-reusable `HoodieBackedTableMetadata` aggressively flushing MT readers

2022-09-29 Thread GitBox



hudi-bot commented on PR #6815:
URL: https://github.com/apache/hudi/pull/6815#issuecomment-1263026828

   
   ## CI report:
   
   * 12160b8c178ef5bd2721727207c41fdfa2f40e8f Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11883)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11912)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] boneanxs commented on pull request #6793: 【HUDI-4917】Optimized the way to get HoodieBaseFile of loadColumnRange…

2022-09-29 Thread GitBox



boneanxs commented on PR #6793:
URL: https://github.com/apache/hudi/pull/6793#issuecomment-1263026464

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] giftbowen commented on pull request #6830: [HUDI-2118] Skip checking corrupt log blocks for transactional write file systems

2022-09-29 Thread GitBox



giftbowen commented on PR #6830:
URL: https://github.com/apache/hudi/pull/6830#issuecomment-1263021647

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] scxwhite commented on issue #6687: [SUPPORT] Poor Upsert Performance on COW table due to indexing

2022-09-29 Thread GitBox



scxwhite commented on issue #6687:
URL: https://github.com/apache/hudi/issues/6687#issuecomment-1263019514

   You can see how to use these indexes in the [official 
documents.](https://hudi.apache.org/docs/basic_configurations#index-configs) If 
you want to know more about bucket index. Take a look at this 
[document](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yuzhaojing closed pull request #6823: [Do Not Merge] test for 0.12.1 rc1

2022-09-29 Thread GitBox



yuzhaojing closed pull request #6823: [Do Not Merge] test for 0.12.1 rc1
URL: https://github.com/apache/hudi/pull/6823


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] boneanxs commented on pull request #6725: [HUDI-4881] Push down filters if possible when syncing partitions to Hive

2022-09-29 Thread GitBox



boneanxs commented on PR #6725:
URL: https://github.com/apache/hudi/pull/6725#issuecomment-1263013312

   @codope @yihua @alexeykudinkin @xushiyan Hi, could you plz take a look this 
improvement?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-4879) MERGE INTO fails when setting "hoodie.datasource.write.payload.class"

2022-09-29 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-4879.
-
Resolution: Fixed

> MERGE INTO fails when setting "hoodie.datasource.write.payload.class"
> -
>
> Key: HUDI-4879
> URL: https://issues.apache.org/jira/browse/HUDI-4879
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Jian Feng
>Priority: Blocker
> Fix For: 0.12.1
>
>
> As reported by the user:
> [https://github.com/apache/hudi/issues/6354]
>  
> Currently, setting \{{hoodie.datasource.write.payload.class = 
> 'org.apache.hudi.common.model.DefaultHoodieRecordPayload' }}will result in 
> the following exception:
> {code:java}
> org.apache.hudi.exception.HoodieUpsertExceptio
> n: Error upserting bucketType UPDATE for partition :0 at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329)
> at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
> at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
> at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
> at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
> at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1498)
> at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merg
> e new record with old value in storage, for new record 
> {HoodieRecord{key=HoodieKey { recordKey=id:1 partitionPath=}, 
> currentLocation='HoodieRecordLocation {instantTime=20220810095846644, 
> fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}', 
> newLocation='HoodieRecordLocation {instantTime=20220810101719437, 
> fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}'}}, old value 
> {{"_hoodie_commit_time": "20220810095824514", "_hoodie_commit_seqno": 
> "20220810095824514_0_0", "_hoodie_record_key": "id:1", 
> "_hoodie_partition_path": "", "_hoodie_file_name": 
> "60c04f95-ca5e-4f82-9558-40da29cc022e-0_0-937-24808_20220810095846644.parquet",
>  "id": 1, "name": "a0", "ts": 1000}} at 
> org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
> at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
> at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
> at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
> ... 28 more
> Caused by: org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merge new 
>

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5416: [HUDI-3963] Use Lock-Free Message Queue Disruptor Improving Hoodie Writing Efficiency

2022-09-29 Thread GitBox



alexeykudinkin commented on code in PR #5416:
URL: https://github.com/apache/hudi/pull/5416#discussion_r984089070


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -230,6 +240,16 @@ public class HoodieWriteConfig extends HoodieConfig {
   .defaultValue(String.valueOf(4 * 1024 * 1024))
   .withDocumentation("Size of in-memory buffer used for parallelizing 
network reads and lake storage writes.");
 
+  public static final ConfigProperty WRITE_BUFFER_SIZE = 
ConfigProperty
+  .key("hoodie.write.buffer.size")
+  .defaultValue(1024)
+  .withDocumentation("The size of the Disruptor Executor ring buffer, must 
be power of 2");
+
+  public static final ConfigProperty WRITE_WAIT_STRATEGY = 
ConfigProperty

Review Comment:
   Same as above



##
hudi-common/src/main/java/org/apache/hudi/common/util/queue/HoodieExecutor.java:
##
@@ -0,0 +1,36 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util.queue;
+
+import java.util.concurrent.ExecutorCompletionService;
+
+public abstract class HoodieExecutor {

Review Comment:
- Let's convert this to an interface
- Please add a java-doc



##
hudi-common/src/main/java/org/apache/hudi/common/util/queue/HoodieDaemonThreadFactory.java:
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util.queue;
+
+import org.jetbrains.annotations.NotNull;
+import java.util.concurrent.ThreadFactory;
+
+public class HoodieDaemonThreadFactory implements ThreadFactory {
+
+  private Runnable preExecuteRunnable;
+
+  public HoodieDaemonThreadFactory(Runnable preExecuteRunnable) {
+this.preExecuteRunnable = preExecuteRunnable;

Review Comment:
   Can you help me understand what kind of prologues we're planning to execute 
here?



##
hudi-common/src/main/java/org/apache/hudi/common/util/queue/HoodieDaemonThreadFactory.java:
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util.queue;
+
+import org.jetbrains.annotations.NotNull;
+import java.util.concurrent.ThreadFactory;
+
+public class HoodieDaemonThreadFactory implements ThreadFactory {
+
+  private Runnable preExecuteRunnable;
+
+  public HoodieDaemonThreadFactory(Runnable preExecuteRunnable) {

Review Comment:
   If we're planning to have a custom factory it's a good idea to add custom 
name to the threads it produces (for them to be more easily identifiable)



##
hudi-common/src/main/java/org/apache/hudi/common/util/queue/DisruptorMessageQueue.java:
##
@@

[GitHub] [hudi] hudi-bot commented on pull request #6741: [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table

2022-09-29 Thread GitBox



hudi-bot commented on PR #6741:
URL: https://github.com/apache/hudi/pull/6741#issuecomment-1262994744

   
   ## CI report:
   
   * bff3acafde6d8a1bd5574b90ce644ef30acbf0a2 UNKNOWN
   * e39d50d6242e272f867c9987a8a2e97ca323568f Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11886)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11915)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on pull request #6741: [HUDI-4898] presto/hive respect payload during merge parquet file and logfile when reading mor table

2022-09-29 Thread GitBox



xiarixiaoyao commented on PR #6741:
URL: https://github.com/apache/hudi/pull/6741#issuecomment-1262994330

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6831: [DO NOT MERGE] doing a test

2022-09-29 Thread GitBox



hudi-bot commented on PR #6831:
URL: https://github.com/apache/hudi/pull/6831#issuecomment-1262989566

   
   ## CI report:
   
   * abde5c46b45518257866a3de7914352920c8c5cf Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11909)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zhengyuan-cn commented on issue #6596: [SUPPORT] with Impala 4.0 Records lost

2022-09-29 Thread GitBox



zhengyuan-cn commented on issue #6596:
URL: https://github.com/apache/hudi/issues/6596#issuecomment-1262986108

   > > I replaced impala hudi dependency jar (hudi-common-0.5.0-incubating.jar, 
hudi-hadoop-mr-0.5.0-incubating.jar) with (hudi-common-0.12.0.jar, 
hudi-hadoop-mr-0.12.0.jar),issues still.
   > 
   > > ENV: impala4.0+hive3.1.1 with hudi 0.11 is correct.
   > 
   > @zhengyuan-cn do you mean you replaced `hudi-*-0.5.0` with `hudi-*-0.11.0` 
and it worked?
   
   > hi  xushiyan ,I debugged in flink + hudi  local mode, I found  
CleanPlanner deleted my partion, I have three 
pations,(2022/09/27,2022/09/28,2022/09/29). CleanPlanner deleted partion 
'2022/09/27'.  detial logs as below.
   
   307113 [pool-16-thread-1] INFO  
org.apache.hudi.table.action.clean.CleanPlanner  - 1 patterns used to delete in 
partition path:2022/09/27
   307113 [pool-16-thread-1] INFO  
org.apache.hudi.table.action.clean.CleanPlanner  - Partition 2022/09/27 to be 
deleted
   
   ---
   detail log :
   `306975 [pool-16-thread-1] INFO  
org.apache.hudi.common.table.view.AbstractTableFileSystemView  - Took 62 ms to 
read  16 instants, 171 replaced file groups
   306998 [pool-16-thread-1] INFO  org.apache.hudi.common.util.ClusteringUtils  
- Found 109 files in pending clustering operations
   306998 [pool-16-thread-1] INFO  
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView  - Sending 
request : 
(http://192.168.1.75:58989/v1/hoodie/view/compactions/pending/?basepath=hdfs%3A%2F%2Fhadoop01%3A9000%2Fhudi%2Fcow-intact-4=20220929165857714=3446cb10ee80b94e6b37ad4052890146807bbf579bd20bed86c2e7564d09b62d)
   307014 [qtp805746605-86] INFO  
org.apache.hudi.timeline.service.RequestHandler  - Syncing view as client 
passed last known instant 20220929165857714 as last known instant but server 
has the following last instant on timeline 
:Option{val=[20220929165857714__commit__COMPLETED]}
   307018 [qtp805746605-86] INFO  
org.apache.hudi.common.table.timeline.HoodieActiveTimeline  - Loaded instants 
upto : Option{val=[==>20220929165927744__commit__INFLIGHT]}
   307049 [qtp805746605-86] INFO  
org.apache.hudi.common.table.view.AbstractTableFileSystemView  - Took 31 ms to 
read  16 instants, 171 replaced file groups
   307072 [qtp805746605-86] INFO  org.apache.hudi.common.util.ClusteringUtils  
- Found 109 files in pending clustering operations
   307078 [pool-16-thread-1] INFO  
org.apache.hudi.table.action.clean.CleanPlanner  - Incremental Cleaning mode is 
enabled. Looking up partition-paths that have since changed since last cleaned 
at 20220929164457499. New Instant to retain : 
Option{val=[20220929164559700__replacecommit__COMPLETED]}
   307086 [pool-16-thread-1] INFO  
org.apache.hudi.table.action.clean.CleanPlanner  - Total Partitions to clean : 
3, with policy KEEP_LATEST_COMMITS
   307086 [pool-16-thread-1] INFO  
org.apache.hudi.table.action.clean.CleanPlanner  - Using cleanerParallelism: 3
   307086 [pool-16-thread-1] INFO  
org.apache.hudi.table.action.clean.CleanPlanner  - Cleaning 2022/09/27, 
retaining latest 30 commits. 
   307086 [ForkJoinPool.commonPool-worker-6] INFO  
org.apache.hudi.table.action.clean.CleanPlanner  - Cleaning 2022/09/28, 
retaining latest 30 commits. 
   307086 [ForkJoinPool.commonPool-worker-11] INFO  
org.apache.hudi.table.action.clean.CleanPlanner  - Cleaning 2022/09/29, 
retaining latest 30 commits. 
   307087 [ForkJoinPool.commonPool-worker-6] INFO  
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView  - Sending 
request : 
(http://192.168.1.75:58989/v1/hoodie/view/filegroups/replaced/before/?partition=2022%2F09%2F28=20220929164559700=hdfs%3A%2F%2Fhadoop01%3A9000%2Fhudi%2Fcow-intact-4=20220929165857714=3446cb10ee80b94e6b37ad4052890146807bbf579bd20bed86c2e7564d09b62d)
   307087 [ForkJoinPool.commonPool-worker-11] INFO  
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView  - Sending 
request : 
(http://192.168.1.75:58989/v1/hoodie/view/filegroups/replaced/before/?partition=2022%2F09%2F29=20220929164559700=hdfs%3A%2F%2Fhadoop01%3A9000%2Fhudi%2Fcow-intact-4=20220929165857714=3446cb10ee80b94e6b37ad4052890146807bbf579bd20bed86c2e7564d09b62d)
   307087 [pool-16-thread-1] INFO  
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView  - Sending 
request : 
(http://192.168.1.75:58989/v1/hoodie/view/filegroups/replaced/before/?partition=2022%2F09%2F27=20220929164559700=hdfs%3A%2F%2Fhadoop01%3A9000%2Fhudi%2Fcow-intact-4=20220929165857714=3446cb10ee80b94e6b37ad4052890146807bbf579bd20bed86c2e7564d09b62d)
   307089 [qtp805746605-535] INFO  
org.apache.hudi.common.table.view.AbstractTableFileSystemView  - Building file 
system view for partition (2022/09/27)
   307090 [qtp805746605-535] INFO  
org.apache.hudi.common.table.view.AbstractTableFileSystemView  - 
addFilesToView: NumFiles=3, NumFileGroups=2, FileGroupsCreationTime=0, 
StoreTimeTaken=0
   307093 [qtp805746605-81] INFO

[GitHub] [hudi] zhengyuan-cn opened a new issue, #6596: [SUPPORT] with Impala 4.0 Records lost

2022-09-29 Thread GitBox



zhengyuan-cn opened a new issue, #6596:
URL: https://github.com/apache/hudi/issues/6596

   ENV: impala4.0+hive3.1.1 with hudi 0.12 
   
   via impala shell execute sql:  select count(*) from  tableName; return rows 
count is (195264946) less than actuall rows 217884008. but by spark SQL return  
217884008 rows, is correct result .
   I  refresh tableName mutl times then still  uncorrect result.
   I replaced impala hudi dependency jar (hudi-common-0.5.0-incubating.jar, 
hudi-hadoop-mr-0.5.0-incubating.jar) with (hudi-common-0.12.0.jar, 
hudi-hadoop-mr-0.12.0.jar),issues still.
   
   ENV: impala4.0+hive3.1.1 with hudi 0.11 is correct.
   
   **Environment Description**
   
   * Hudi version : 0.12
   
   * Spark version : spark-2.4.8
   
   * Hive version : 3.1.1 (with impala comes with it )
   
   * Hadoop version : hadoop-3.2.2
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   Additional :
   Impala:
   `[192.168.1.52:21000] hudi> refresh model_series_data_3;  
   Connection lost, reconnecting...
   Opened TCP connection to 192.168.1.52:21000
   Query: use `hudi`
   Query: refresh model_series_data_3
   Query submitted at: 2022-09-05 07:07:44 (Coordinator: 
http://192.168.10.52:25000)
   Query progress can be monitored at: 
http://192.168.1.52:25000/query_plan?query_id=b34a6e2e71c0af91:2521ad2d
   Fetched 0 row(s) in 0.28s
   [192.168.1.52:21000] hudi> select count(*) from model_series_data_3; 
   Query: select count(*) from model_series_data_3
   Query submitted at: 2022-09-05 07:07:46 (Coordinator: 
http://192.168.10.52:25000)
   Query progress can be monitored at: 
http://192.168.1.52:25000/query_plan?query_id=f848080d361104ad:ebb3af9a
   +---+
 
   | count(*)  |
   +---+
   | 195264946 |
   +---+
   Fetched 1 row(s) in 2.72s`
   
   ==
   Spark :
   `+-+
   | count(1)|
   +-+
   |217884008|
   +-+
   
   16:30:59,796  INFO AbstractConnector:381 - Stopped Spark@47da3952{HTTP/1.1, 
(http/1.1)}{0.0.0.0:4040}
   16:30:59,797  INFO SparkUI:54 - Stopped Spark web UI at 
http://192.168.2.56:4040`
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zhengyuan-cn commented on issue #6596: [SUPPORT] with Impala 4.0 Records lost

2022-09-29 Thread GitBox



zhengyuan-cn commented on issue #6596:
URL: https://github.com/apache/hudi/issues/6596#issuecomment-1262984003

   > > > I replaced impala hudi dependency jar 
(hudi-common-0.5.0-incubating.jar, hudi-hadoop-mr-0.5.0-incubating.jar) with 
(hudi-common-0.12.0.jar, hudi-hadoop-mr-0.12.0.jar),issues still.
   > > 
   > > 
   > > > ENV: impala4.0+hive3.1.1 with hudi 0.11 is correct.
   > > 
   > > 
   > > @zhengyuan-cn do you mean you replaced `hudi-*-0.5.0` with 
`hudi-*-0.11.0` and it worked?
   > 
   > NO, in env( impala4.0+hive3.1.1 with hudi 0.11) it worked, and result is 
correct.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zhengyuan-cn closed issue #6596: [SUPPORT] with Impala 4.0 Records lost

2022-09-29 Thread GitBox



zhengyuan-cn closed issue #6596: [SUPPORT] with Impala 4.0 Records lost
URL: https://github.com/apache/hudi/issues/6596


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6817: [HUDI-4942] Fix RowSource schema provider

2022-09-29 Thread GitBox



hudi-bot commented on PR #6817:
URL: https://github.com/apache/hudi/pull/6817#issuecomment-1262945708

   
   ## CI report:
   
   * e1589ebfa7aea943040a85de3b93a4613b365d83 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11908)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan merged pull request #6355: [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand

2022-09-29 Thread GitBox



nsivabalan merged PR #6355:
URL: https://github.com/apache/hudi/pull/6355


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6665: [HUDI-4850] Incremental Ingestion from GCS

2022-09-29 Thread GitBox



hudi-bot commented on PR #6665:
URL: https://github.com/apache/hudi/pull/6665#issuecomment-1262893169

   
   ## CI report:
   
   * 4864b65515d6e9ea5b6ba9d83241cfc310cbf3ee UNKNOWN
   * 5ed92a20666863315f41578a905dd6f2681a1363 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11907)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6575: [HUDI-4754] Add compliance check in github actions

2022-09-29 Thread GitBox



hudi-bot commented on PR #6575:
URL: https://github.com/apache/hudi/pull/6575#issuecomment-1262892850

   
   ## CI report:
   
   * 1600e31836157c8d05e3bc8b9e08e1717471f1a6 UNKNOWN
   * 4d02f2c64a5fc4b89889677ee639a20b53cec26a UNKNOWN
   * 48147d19c835e7868102fd2d083659e6ee2ac343 UNKNOWN
   * b524fcc1dc3a5ce4d32a1238e09b9cc58b3e26b6 UNKNOWN
   * 3f2440a00e10b2c2daa4d930fd2933d48f5be1a2 UNKNOWN
   * 5dfc76a457a1ef80cc87d35a2bd24bab01edfd5b UNKNOWN
   * 51979ee5abe5df950a320e0b0ba02532c589432d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11906)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand (#6355)

2022-09-29 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 15ca7a3060 [HUDI-4925] Should Force to use ExpressionPayload in 
MergeIntoTableCommand (#6355)
15ca7a3060 is described below

commit 15ca7a306058c5d8c708b5310cb92f213f8d5834
Author: 冯健 
AuthorDate: Fri Sep 30 06:34:00 2022 +0800

[HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand 
(#6355)


Co-authored-by: jian.feng 
---
 .../hudi/command/MergeIntoHoodieTableCommand.scala |  6 ++--
 .../spark/sql/hudi/TestMergeIntoTable2.scala   | 40 +-
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
index 2761a00205..f0394ad379 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala
@@ -509,7 +509,8 @@ case class MergeIntoHoodieTableCommand(mergeInto: 
MergeIntoTable) extends Hoodie
 val targetTableDb = targetTableIdentify.database.getOrElse("default")
 val targetTableName = targetTableIdentify.identifier
 val path = hoodieCatalogTable.tableLocation
-val catalogProperties = hoodieCatalogTable.catalogProperties
+// force to use ExpressionPayload as WRITE_PAYLOAD_CLASS_NAME in 
MergeIntoHoodieTableCommand
+val catalogProperties = hoodieCatalogTable.catalogProperties + 
(PAYLOAD_CLASS_NAME.key -> classOf[ExpressionPayload].getCanonicalName)
 val tableConfig = hoodieCatalogTable.tableConfig
 val tableSchema = hoodieCatalogTable.tableSchema
 val partitionColumns = 
tableConfig.getPartitionFieldProp.split(",").map(_.toLowerCase)
@@ -523,14 +524,13 @@ case class MergeIntoHoodieTableCommand(mergeInto: 
MergeIntoTable) extends Hoodie
 val hoodieProps = getHoodieProps(catalogProperties, tableConfig, 
sparkSession.sqlContext.conf)
 val hiveSyncConfig = buildHiveSyncConfig(hoodieProps, hoodieCatalogTable)
 
-withSparkConf(sparkSession, hoodieCatalogTable.catalogProperties) {
+withSparkConf(sparkSession, catalogProperties) {
   Map(
 "path" -> path,
 RECORDKEY_FIELD.key -> tableConfig.getRecordKeyFieldProp,
 PRECOMBINE_FIELD.key -> preCombineField,
 TBL_NAME.key -> hoodieCatalogTable.tableName,
 PARTITIONPATH_FIELD.key -> tableConfig.getPartitionFieldProp,
-PAYLOAD_CLASS_NAME.key -> classOf[ExpressionPayload].getCanonicalName,
 HIVE_STYLE_PARTITIONING.key -> 
tableConfig.getHiveStylePartitioningEnable,
 URL_ENCODE_PARTITIONING.key -> tableConfig.getUrlEncodePartitioning,
 KEYGENERATOR_CLASS_NAME.key -> 
classOf[SqlKeyGenerator].getCanonicalName,
diff --git 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala
 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala
index 8e6acd1be5..8a6aa9691d 100644
--- 
a/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala
+++ 
b/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable2.scala
@@ -674,7 +674,7 @@ class TestMergeIntoTable2 extends HoodieSparkSqlTestBase {
 }
   }
 
-  test ("Test Merge into with String cast to Double") {
+  test("Test Merge into with String cast to Double") {
 withTempDir { tmp =>
   val tableName = generateTableName
   // Create a cow partitioned table.
@@ -713,4 +713,42 @@ class TestMergeIntoTable2 extends HoodieSparkSqlTestBase {
   )
 }
   }
+
+  test("Test Merge into where manually set DefaultHoodieRecordPayload") {
+withTempDir { tmp =>
+  val tableName = generateTableName
+  // Create a cow table with default payload class, check whether it will 
be overwritten by ExpressionPayload.
+  // if not, this ut cannot pass since DefaultHoodieRecordPayload can not 
promotion int to long when insert a ts with Integer value
+  spark.sql(
+s"""
+   | create table $tableName (
+   |  id int,
+   |  name string,
+   |  ts long
+   | ) using hudi
+   | tblproperties (
+   |  type = 'cow',
+   |  primaryKey = 'id',
+   |  preCombineField = 'ts',
+   |  hoodie.datasource.write.payload.class = 
'org.apache.hudi.common.model.DefaultHoodieRecordPayload'
+   | ) location '${tmp.getCanonicalPath}'
+ """.stripMargin)
+  // Insert data
+  spark.sql(s"insert into $tableName

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6680: [HUDI-4812] lazy fetching partition path & file slice for HoodieFileIndex

2022-09-29 Thread GitBox



alexeykudinkin commented on code in PR #6680:
URL: https://github.com/apache/hudi/pull/6680#discussion_r984047423


##
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##
@@ -179,15 +197,125 @@ public void close() throws Exception {
   }
 
   protected List getAllQueryPartitionPaths() {
+if (cachedAllPartitionPaths != null) {
+  return cachedAllPartitionPaths;
+}
+
+loadAllQueryPartitionPaths();
+return cachedAllPartitionPaths;
+  }
+
+  private void loadAllQueryPartitionPaths() {
 List queryRelativePartitionPaths = queryPaths.stream()
 .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
 .collect(Collectors.toList());
 
-// Load all the partition path from the basePath, and filter by the query 
partition path.
-// TODO load files from the queryRelativePartitionPaths directly.
-List matchedPartitionPaths = getAllPartitionPathsUnchecked()
-.stream()
-.filter(path -> 
queryRelativePartitionPaths.stream().anyMatch(path::startsWith))
+this.cachedAllPartitionPaths = 
listQueryPartitionPaths(queryRelativePartitionPaths);
+
+// If the partition value contains InternalRow.empty, we query it as a 
non-partitioned table.
+this.queryAsNonePartitionedTable = 
this.cachedAllPartitionPaths.stream().anyMatch(p -> p.values.length == 0);

Review Comment:
   We don't need this field anymore we can use `isPartitionedTable` method



##
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##
@@ -179,15 +197,125 @@ public void close() throws Exception {
   }
 
   protected List getAllQueryPartitionPaths() {
+if (cachedAllPartitionPaths != null) {
+  return cachedAllPartitionPaths;
+}
+
+loadAllQueryPartitionPaths();
+return cachedAllPartitionPaths;
+  }
+
+  private void loadAllQueryPartitionPaths() {
 List queryRelativePartitionPaths = queryPaths.stream()
 .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
 .collect(Collectors.toList());
 
-// Load all the partition path from the basePath, and filter by the query 
partition path.
-// TODO load files from the queryRelativePartitionPaths directly.
-List matchedPartitionPaths = getAllPartitionPathsUnchecked()
-.stream()
-.filter(path -> 
queryRelativePartitionPaths.stream().anyMatch(path::startsWith))
+this.cachedAllPartitionPaths = 
listQueryPartitionPaths(queryRelativePartitionPaths);
+
+// If the partition value contains InternalRow.empty, we query it as a 
non-partitioned table.
+this.queryAsNonePartitionedTable = 
this.cachedAllPartitionPaths.stream().anyMatch(p -> p.values.length == 0);
+  }
+
+  protected Map> getAllInputFileSlices() {
+if (!isAllInputFileSlicesCached) {
+  doRefresh();
+}
+return cachedAllInputFileSlices;
+  }
+
+  /**
+   * Get input file slice for the given partition. Will use cache directly if 
it is computed before.
+   */
+  protected List getCachedInputFileSlices(PartitionPath partition) {
+return cachedAllInputFileSlices.computeIfAbsent(partition, 
this::loadFileSlicesForPartition);
+  }
+
+  private List loadFileSlicesForPartition(PartitionPath p) {
+FileStatus[] files = loadPartitionPathFiles(p);
+HoodieTimeline activeTimeline = getActiveTimeline();
+Option latestInstant = activeTimeline.lastInstant();
+
+HoodieTableFileSystemView fileSystemView = new 
HoodieTableFileSystemView(metaClient, activeTimeline, files);
+
+Option queryInstant = specifiedQueryInstant.or(() -> 
latestInstant.map(HoodieInstant::getTimestamp));
+
+validate(activeTimeline, queryInstant);
+
+List ret;
+if (tableType.equals(HoodieTableType.MERGE_ON_READ) && 
queryType.equals(HoodieTableQueryType.SNAPSHOT)) {
+  ret = queryInstant.map(instant ->
+  fileSystemView.getLatestMergedFileSlicesBeforeOrOn(p.path, 
queryInstant.get())
+  .collect(Collectors.toList())
+  )
+  .orElse(Collections.emptyList());
+} else {
+  ret = queryInstant.map(instant ->
+  fileSystemView.getLatestFileSlicesBeforeOrOn(p.path, instant, 
true)
+  )
+  .orElse(fileSystemView.getLatestFileSlices(p.path))
+  .collect(Collectors.toList());
+}
+
+cachedFileSize += 
ret.stream().mapToLong(BaseHoodieTableFileIndex::fileSliceSize).sum();
+return ret;
+  }
+
+  /**
+   * Get partition path with the given partition value
+   * @param partitionNames partition names
+   * @param values partition values
+   * @return partitions that match the given partition values
+   */
+  protected List getPartitionPaths(String[] partitionNames, 
String[] values) {
+if (partitionNames.length == 0 || partitionNames.length != values.length) {

Review Comment:
   Let's actually extract composing of the relative partition path (from 
values) into a standalone method. Then we can get eliminate this one and then 
just do:

[hudi] branch asf-site updated: [DOCS] Add new blogs (#6833)

2022-09-29 Thread yihua

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 001100a4ed [DOCS] Add new blogs (#6833)
001100a4ed is described below

commit 001100a4ed468aa7f384b426b0ba979a00734227
Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com>
AuthorDate: Thu Sep 29 15:15:49 2022 -0700

[DOCS] Add new blogs (#6833)
---
 README.md   |  1 +
 ...plementation-of-SCD-2-with-Apache-Hudi-and-Spark.mdx | 17 +
 ...-Data-Lake-Table-Formats-Delta-Lake-Iceberg-Hudi.mdx | 17 +
 3 files changed, 35 insertions(+)

diff --git a/README.md b/README.md
index f688f75f15..938621247a 100644
--- a/README.md
+++ b/README.md
@@ -186,6 +186,7 @@ Take a look at this blog for reference - (Apache Hudi vs 
Delta Lake vs Apache Ic
   - use-case (some community users talking about their use-case)
   - design (technical articles talking about Hudi internal design/impl)
   - performance (involves performance related blogs)
+  - blog (anything else such as announcements/release 
updates/insights/guides/tutorials/concepts overview etc)
2. tag 2
- Represent individual features - clustering, compaction, ingestion, 
meta-sync etc.
3. tag 3
diff --git 
a/website/blog/2022-08-24-Implementation-of-SCD-2-with-Apache-Hudi-and-Spark.mdx
 
b/website/blog/2022-08-24-Implementation-of-SCD-2-with-Apache-Hudi-and-Spark.mdx
new file mode 100644
index 00..e876ab202e
--- /dev/null
+++ 
b/website/blog/2022-08-24-Implementation-of-SCD-2-with-Apache-Hudi-and-Spark.mdx
@@ -0,0 +1,17 @@
+---
+title: "Implementation of SCD-2 (Slowly Changing Dimension) with Apache Hudi & 
Spark"
+authors:
+- name: Jayasheel Kalgal
+- name: Esha Dhing
+- name: Prashant Mishra
+category: blog
+image: 
/assets/images/blog/2022-08-24_implementation_of_scd_2_with_hudi_and_spark.jpeg
+tags:
+- use-case
+- scd2
+- walmartglobaltech
+---
+
+import Redirect from '@site/src/components/Redirect';
+
+https://medium.com/walmartglobaltech/implementation-of-scd-2-slowly-changing-dimension-with-apache-hudi-465e0eb94a5;>Redirecting...
 please wait!! 
diff --git 
a/website/blog/2022-09-20-Data-Lake-Lakehouse-Guide-Powered-by-Data-Lake-Table-Formats-Delta-Lake-Iceberg-Hudi.mdx
 
b/website/blog/2022-09-20-Data-Lake-Lakehouse-Guide-Powered-by-Data-Lake-Table-Formats-Delta-Lake-Iceberg-Hudi.mdx
new file mode 100644
index 00..4a67b2337d
--- /dev/null
+++ 
b/website/blog/2022-09-20-Data-Lake-Lakehouse-Guide-Powered-by-Data-Lake-Table-Formats-Delta-Lake-Iceberg-Hudi.mdx
@@ -0,0 +1,17 @@
+---
+title: "Building Streaming Data Lakes with Hudi and MinIO"
+authors:
+- name: Matt Sarrel
+category: blog
+image: 
/assets/images/blog/2022-09-20_streaming_data_lakes_with_hudi_and_minio.png
+tags:
+- how-to
+- datalake
+- datalake-platform
+- streaming ingestion
+- minio
+---
+
+import Redirect from '@site/src/components/Redirect';
+
+https://blog.min.io/streaming-data-lakes-hudi-minio/;>Redirecting... 
please wait!!

[GitHub] [hudi] yihua merged pull request #6833: [DOCS] Add new blogs

2022-09-29 Thread GitBox



yihua merged PR #6833:
URL: https://github.com/apache/hudi/pull/6833


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua opened a new pull request, #6834: [DOCS] Add 1.0.0 release entry to Roadmap

2022-09-29 Thread GitBox



yihua opened a new pull request, #6834:
URL: https://github.com/apache/hudi/pull/6834

   ### Change Logs
   
   As above.
   
   ### Impact
   
   **Risk level: none**
   
   The website can be built and visualized.
   
   ### Documentation Update
   
   N/A.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bhasudha commented on pull request #6833: [DOCS] Add new blogs

2022-09-29 Thread GitBox



bhasudha commented on PR #6833:
URL: https://github.com/apache/hudi/pull/6833#issuecomment-1262862834

   Screenshot attached from local testing
   
   https://user-images.githubusercontent.com/2179254/193150303-4a14718d-12aa-42d9-9d7d-13a4a011b385.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bhasudha opened a new pull request, #6833: [DOCS] Add new blogs

2022-09-29 Thread GitBox



bhasudha opened a new pull request, #6833:
URL: https://github.com/apache/hudi/pull/6833

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the 
default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. 
Please create a Jira ticket, attach the
 ticket number here and follow the 
[instruction](https://hudi.apache.org/contribute/developer-setup#website) to 
make
 changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #5113: [HUDI-3625] [RFC-60] Optimized storage layout for Cloud Object Stores

2022-09-29 Thread GitBox



yihua commented on code in PR #5113:
URL: https://github.com/apache/hudi/pull/5113#discussion_r984045645


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,226 @@
+
+
+# RFC-56: Federated Storage Layer
+
+## Proposers
+- @umehrot2
+
+## Approvers
+- @vinoth
+- @shivnarayan
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625)
+
+## Abstract
+
+As you scale your Apache Hudi workloads over Cloud object stores like Amazon 
S3, there is potential of hitting request
+throttling limits which in-turn impacts performance. In this RFC, we are 
proposing to support an alternate storage
+layout that is optimized for Amazon S3 and other cloud object stores, which 
helps achieve maximum throughput and
+significantly reduce throttling.
+
+In addition, we are proposing an interface that would allow users to implement 
their own custom strategy to allow them
+to distribute the data files across cloud stores, hdfs or on prem based on 
their specific use-cases.
+
+## Background
+
+Apache Hudi follows the traditional Hive storage layout while writing files on 
storage:
+- Partitioned Tables: The files are distributed across multiple physical 
partition folders, under the table's base path.
+- Non Partitioned Tables: The files are stored directly under the table's base 
path.
+
+While this storage layout scales well for HDFS, it increases the probability 
of hitting request throttle limits when
+working with cloud object stores like Amazon S3 and others. This is because 
Amazon S3 and other cloud stores [throttle
+requests based on object 
prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
+Amazon S3 does scale based on request patterns for different prefixes and adds 
internal partitions (with their own request limits),
+but there can be a 30 - 60 minute wait time before new partitions are created. 
Thus, all files/objects stored under the
+same table path prefix could result in these request limits being hit for the 
table prefix, specially as workloads
+scale, and there are several thousands of files being written/updated 
concurrently. This hurts performance due to
+re-trying of failed requests affecting throughput, and result in occasional 
failures if the retries are not able to
+succeed either and continue to be throttled.
+
+The traditional storage layout also tightly couples the partitions as folders 
under the table path. However,
+some users want flexibility to be able to distribute files/partitions under 
multiple different paths across cloud stores,
+hdfs etc. based on their specific needs. For example, customers have use cases 
to distribute files for each partition under
+a separate S3 bucket with its individual encryption key. It is not possible to 
implement such use-cases with Hudi currently.
+
+The high level proposal here is to introduce a new storage layout strategy, 
where all files are distributed evenly across
+multiple randomly generated prefixes under the Amazon S3 bucket, instead of 
being stored under a common table path/prefix.
+This would help distribute the requests evenly across different prefixes, 
resulting in Amazon S3 to create partitions for
+the prefixes each with its own request limit. This significantly reduces the 
possibility of hitting the request limit
+for a specific prefix/partition.
+
+In addition, we want to expose an interface that provides users the 
flexibility to implement their own strategy for
+distributing files if using the traditional Hive storage layout or federated 
storage layer (proposed in this RFC) does
+not meet their use-case.
+
+## Design
+
+### Interface
+
+```java
+/**
+ * Interface for providing storage file locations.
+ */
+public interface FederatedStorageStrategy extends Serializable {
+  /**
+   * Return a fully-qualified storage file location for the given filename.
+   *
+   * @param fileName data file name
+   * @return a fully-qualified location URI for a data file
+   */
+  String storageLocation(String fileName);
+
+  /**
+   * Return a fully-qualified storage file location for the given partition 
and filename.
+   *
+   * @param partitionPath partition path for the file
+   * @param fileName data file name
+   * @return a fully-qualified location URI for a data file
+   */
+  String storageLocation(String partitionPath, String fileName);
+}
+```
+
+### Generating file paths for Cloud storage optimized layout
+
+We want to distribute files evenly across multiple random prefixes, instead of 
following the traditional Hive storage
+layout of keeping them under a common table path/prefix. In addition to the 
`Table Path`, for this new layout user will
+configure another `Table Storage Path` under which the actual data files will 
be distributed. The original `Table Path` will
+be used to maintain the table/partitions Hudi metadata.
+
+For the purpose of this documentation lets assume:
+```
+Table Path => s3:
+
+Table

[GitHub] [hudi] yihua commented on a diff in pull request #5113: [HUDI-3625] [RFC-60] Optimized storage layout for Cloud Object Stores

2022-09-29 Thread GitBox



yihua commented on code in PR #5113:
URL: https://github.com/apache/hudi/pull/5113#discussion_r984024879


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,226 @@
+
+
+# RFC-56: Federated Storage Layer
+
+## Proposers
+- @umehrot2
+
+## Approvers
+- @vinoth
+- @shivnarayan
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625)
+
+## Abstract
+
+As you scale your Apache Hudi workloads over Cloud object stores like Amazon 
S3, there is potential of hitting request
+throttling limits which in-turn impacts performance. In this RFC, we are 
proposing to support an alternate storage
+layout that is optimized for Amazon S3 and other cloud object stores, which 
helps achieve maximum throughput and
+significantly reduce throttling.
+
+In addition, we are proposing an interface that would allow users to implement 
their own custom strategy to allow them
+to distribute the data files across cloud stores, hdfs or on prem based on 
their specific use-cases.
+
+## Background
+
+Apache Hudi follows the traditional Hive storage layout while writing files on 
storage:
+- Partitioned Tables: The files are distributed across multiple physical 
partition folders, under the table's base path.
+- Non Partitioned Tables: The files are stored directly under the table's base 
path.
+
+While this storage layout scales well for HDFS, it increases the probability 
of hitting request throttle limits when
+working with cloud object stores like Amazon S3 and others. This is because 
Amazon S3 and other cloud stores [throttle
+requests based on object 
prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
+Amazon S3 does scale based on request patterns for different prefixes and adds 
internal partitions (with their own request limits),
+but there can be a 30 - 60 minute wait time before new partitions are created. 
Thus, all files/objects stored under the
+same table path prefix could result in these request limits being hit for the 
table prefix, specially as workloads
+scale, and there are several thousands of files being written/updated 
concurrently. This hurts performance due to
+re-trying of failed requests affecting throughput, and result in occasional 
failures if the retries are not able to
+succeed either and continue to be throttled.
+
+The traditional storage layout also tightly couples the partitions as folders 
under the table path. However,
+some users want flexibility to be able to distribute files/partitions under 
multiple different paths across cloud stores,
+hdfs etc. based on their specific needs. For example, customers have use cases 
to distribute files for each partition under
+a separate S3 bucket with its individual encryption key. It is not possible to 
implement such use-cases with Hudi currently.
+
+The high level proposal here is to introduce a new storage layout strategy, 
where all files are distributed evenly across
+multiple randomly generated prefixes under the Amazon S3 bucket, instead of 
being stored under a common table path/prefix.
+This would help distribute the requests evenly across different prefixes, 
resulting in Amazon S3 to create partitions for
+the prefixes each with its own request limit. This significantly reduces the 
possibility of hitting the request limit
+for a specific prefix/partition.
+
+In addition, we want to expose an interface that provides users the 
flexibility to implement their own strategy for
+distributing files if using the traditional Hive storage layout or federated 
storage layer (proposed in this RFC) does
+not meet their use-case.
+
+## Design
+
+### Interface
+
+```java
+/**
+ * Interface for providing storage file locations.
+ */
+public interface FederatedStorageStrategy extends Serializable {
+  /**
+   * Return a fully-qualified storage file location for the given filename.
+   *
+   * @param fileName data file name
+   * @return a fully-qualified location URI for a data file
+   */
+  String storageLocation(String fileName);
+
+  /**
+   * Return a fully-qualified storage file location for the given partition 
and filename.
+   *
+   * @param partitionPath partition path for the file
+   * @param fileName data file name
+   * @return a fully-qualified location URI for a data file
+   */
+  String storageLocation(String partitionPath, String fileName);

Review Comment:
   What does the `fileName` refer to here?  Is it the logical file name of a 
base or log file in a Hudi file slice?  And is this relative or absolute?



##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,226 @@
+
+
+# RFC-56: Federated Storage Layer
+
+## Proposers
+- @umehrot2
+
+## Approvers
+- @vinoth
+- @shivnarayan
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625)
+
+## Abstract
+
+As you scale your Apache Hudi workloads over Cloud object stores like Amazon 
S3, there is potential of hitting

[GitHub] [hudi] hudi-bot commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-09-29 Thread GitBox



hudi-bot commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1262834872

   
   ## CI report:
   
   * 288d166c49602a4593b1e97763a467811903737d UNKNOWN
   * ae59f6f918a5a08535b73be5c3fc2f29f5e84fb9 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11879)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11913)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6355: [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand

2022-09-29 Thread GitBox



hudi-bot commented on PR #6355:
URL: https://github.com/apache/hudi/pull/6355#issuecomment-1262834668

   
   ## CI report:
   
   * 51fe330035a595e4d65cdf58554077ed0916fd25 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11905)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6815: [HUDI-4937] Fix `HoodieTable` injecting non-reusable `HoodieBackedTableMetadata` aggressively flushing MT readers

2022-09-29 Thread GitBox



hudi-bot commented on PR #6815:
URL: https://github.com/apache/hudi/pull/6815#issuecomment-1262827821

   
   ## CI report:
   
   * 12160b8c178ef5bd2721727207c41fdfa2f40e8f Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11883)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11912)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch asf-site updated: [DOCS] Add images for new blogs

2022-09-29 Thread bhavanisudha

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 8260a6882c [DOCS] Add images for new blogs
8260a6882c is described below

commit 8260a6882ca80d9995bba1880f5668576f966043
Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com>
AuthorDate: Thu Sep 29 14:04:38 2022 -0700

[DOCS] Add images for new blogs
---
 ...24_implementation_of_scd_2_with_hudi_and_spark.jpeg | Bin 0 -> 183751 bytes
 ...-09-20_streaming_data_lakes_with_hudi_and_minio.png | Bin 0 -> 213834 bytes
 2 files changed, 0 insertions(+), 0 deletions(-)

diff --git 
a/website/static/assets/images/blog/2022-08-24_implementation_of_scd_2_with_hudi_and_spark.jpeg
 
b/website/static/assets/images/blog/2022-08-24_implementation_of_scd_2_with_hudi_and_spark.jpeg
new file mode 100644
index 00..deb165ec78
Binary files /dev/null and 
b/website/static/assets/images/blog/2022-08-24_implementation_of_scd_2_with_hudi_and_spark.jpeg
 differ
diff --git 
a/website/static/assets/images/blog/2022-09-20_streaming_data_lakes_with_hudi_and_minio.png
 
b/website/static/assets/images/blog/2022-09-20_streaming_data_lakes_with_hudi_and_minio.png
new file mode 100644
index 00..364979dc31
Binary files /dev/null and 
b/website/static/assets/images/blog/2022-09-20_streaming_data_lakes_with_hudi_and_minio.png
 differ

[GitHub] [hudi] alexeykudinkin commented on issue #6758: [SUPPORT] Will metatable support partitions inside col_stat & files?

2022-09-29 Thread GitBox



alexeykudinkin commented on issue #6758:
URL: https://github.com/apache/hudi/issues/6758#issuecomment-1262814782

   @Zhangshunyu we're able to do this filtering even w/o physical partitioning 
(thanks to relying on HFile and elaborate key encoding scheme) -- we only read 
the records corresponding to files (in case of Column Stats) pertaining to 
particular partition. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #6815: [HUDI-4937] Fix `HoodieTable` injecting non-reusable `HoodieBackedTableMetadata` aggressively flushing MT readers

2022-09-29 Thread GitBox



alexeykudinkin commented on PR #6815:
URL: https://github.com/apache/hudi/pull/6815#issuecomment-1262808073

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6805: [HUDI-4949] optimize cdc read to avoid the problem of reusing buffer underlying the Row

2022-09-29 Thread GitBox



alexeykudinkin commented on code in PR #6805:
URL: https://github.com/apache/hudi/pull/6805#discussion_r984015580


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/HoodieCDCRDD.scala:
##
@@ -516,7 +515,7 @@ class HoodieCDCRDD(
 val iter = loadFileSlice(fileSlice)
 iter.foreach { row =>
   val key = getRecordKey(row)
-  beforeImageRecords.put(key, serialize(row))
+  beforeImageRecords.put(key, serialize(row, copy = true))

Review Comment:
   Let's add a comment explaining why we're copying here (to avoid confusion)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #6358: [HUDI-4588][HUDI-4472] Fixing `HoodieParquetReader` to properly specify projected schema when reading Parquet file

2022-09-29 Thread GitBox



alexeykudinkin commented on PR #6358:
URL: https://github.com/apache/hudi/pull/6358#issuecomment-1262807900

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nochimow commented on issue #6811: [SUPPORT] Slow upsert performance

2022-09-29 Thread GitBox



nochimow commented on issue #6811:
URL: https://github.com/apache/hudi/issues/6811#issuecomment-1262804768

   Hi @nsivabalan,
   ~97% of the data should be inserts and the remaning are updates. The updates 
only touches the latest partitions. (-1 day at max)
   No, we are not setting any small file config in this case. 
   
   Based on that, there is any tweak suggestion to decrease the index tagging 
stage?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2022-09-29 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3204:
--
Description: 
{color:#172b4d}Currently, b/c Spark by default omits partition values from the 
data files (instead encoding them into partition paths for partitioned tables), 
using `TimestampBasedKeyGenerator` w/ original timestamp based-column makes it 
impossible to retrieve the original value (reading from Spark) even though it's 
persisted in the data file as well.{color}

 
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
import org.apache.hudi.hive.MultiPartKeysValueExtractor

val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
"2018-09-24")).toDF("id", "name", "age", "ts", "data_date")

// mor
df.write.format("hudi").
option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
option("hoodie.datasource.write.recordkey.field", "id").
option("hoodie.datasource.write.partitionpath.field", "data_date").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.TimestampBasedKeyGenerator").
option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd").
option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd").
mode(org.apache.spark.sql.SaveMode.Append).
save("file:///tmp/hudi/issue_4417_mor")

+---++--+--++---++---+---+--+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| id|name|age| ts| data_date|
+---++--+--++---++---+---+--+
|  20220110172709324|20220110172709324...|                 2|            
2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
|  20220110172709324|20220110172709324...|                 1|            
2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
+---++--+--++---++---+---+--+

// can not query any data
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
 = '2018-09-24'")
// still can not query any data
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
 = '2018/09/24'").show 

// cow
df.write.format("hudi").
option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
option("hoodie.datasource.write.recordkey.field", "id").
option("hoodie.datasource.write.partitionpath.field", "data_date").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.TimestampBasedKeyGenerator").
option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
option("hoodie.deltastreamer.keygen.timebased.output.dateformat", "/MM/dd").
option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
option("hoodie.deltastreamer.keygen.timebased.input.dateformat", "-MM-dd").
mode(org.apache.spark.sql.SaveMode.Append).
save("file:///tmp/hudi/issue_4417_cow") 

+---++--+--++---++---+---+--+
 
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| id|name|age| ts| data_date| 
+---++--+--++---++---+---+--+
 |  20220110172721896|20220110172721896...|                 2|            
2018/09/24|81cc7819-a0d1-4e6...|  2|  z3| 35| v1|2018/09/24| |  
20220110172721896|20220110172721896...|                 1|            
2018/09/23|d428019b-a829-41a...|  1|  z3| 30| v1|2018/09/23| 
+---++--+--++---++---+---+--+
 
// can not query any data
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date
 = '2018-09-24'").show

// but 2018/09/24 works
spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_cow").where("data_date
 = '2018/09/24'").show  {code}
 

 

  was:
 
{code:java}
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
import

[GitHub] [hudi] alexeykudinkin commented on pull request #6355: [HUDI-4925] Should Force to use ExpressionPayload in MergeIntoTableCommand

2022-09-29 Thread GitBox



alexeykudinkin commented on PR #6355:
URL: https://github.com/apache/hudi/pull/6355#issuecomment-1262804005

   CI is green:
   
   https://user-images.githubusercontent.com/428277/193139753-763ed18d-ee41-4e29-9eab-850c05f99912.png;>
   
   
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=11905=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file

2022-09-29 Thread GitBox



alexeykudinkin commented on issue #6798:
URL: https://github.com/apache/hudi/issues/6798#issuecomment-1262803683

   @sstimmel this is a known issue due to how Spark treats partition-columns 
(by default, Spark doesn't persist them in the data files, but instead encoding 
them into partition path). Since we're relying on some of the Spark infra to 
read the data to make sure that Hudi's tables are compatible w/ Spark execution 
engines optimizations we're unfortunately strangled by these limitations 
currently, but we're actively looking for solutions there. 
   
   You can find more details in the HUDI-3204


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3204) Allow original partition column value to be retrieved when using TimestampBasedKeyGen

2022-09-29 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3204:
--
Summary: Allow original partition column value to be retrieved when using 
TimestampBasedKeyGen  (was: spark on TimestampBasedKeyGenerator has no result 
when query by partition column)

> Allow original partition column value to be retrieved when using 
> TimestampBasedKeyGen
> -
>
> Key: HUDI-3204
> URL: https://issues.apache.org/jira/browse/HUDI-3204
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-on-call, pull-request-available, sev:critical
> Fix For: 0.12.1
>
>   Original Estimate: 3h
>  Time Spent: 1h
>  Remaining Estimate: 1h
>
>  
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.keygen.constant.KeyGeneratorOptions._
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> val df = Seq((1, "z3", 30, "v1", "2018-09-23"), (2, "z3", 35, "v1", 
> "2018-09-24")).toDF("id", "name", "age", "ts", "data_date")
> // mor
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_mor").
> option("hoodie.datasource.write.table.type", "MERGE_ON_READ").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_mor")
> +---++--+--++---++---+---+--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date|
> +---++--+--++---++---+---+--+
> |  20220110172709324|20220110172709324...|                 2|            
> 2018/09/24|703e56d3-badb-40b...|  2|  z3| 35| v1|2018-09-24|
> |  20220110172709324|20220110172709324...|                 1|            
> 2018/09/23|58fde2b3-db0e-464...|  1|  z3| 30| v1|2018-09-23|
> +---++--+--++---++---+---+--+
> // can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018-09-24'")
> // still can not query any data
> spark.read.format("hudi").load("file:///tmp/hudi/issue_4417_mor").where("data_date
>  = '2018/09/24'").show 
> // cow
> df.write.format("hudi").
> option(HoodieWriteConfig.TABLE_NAME, "issue_4417_cow").
> option("hoodie.datasource.write.table.type", "COPY_ON_WRITE").
> option("hoodie.datasource.write.recordkey.field", "id").
> option("hoodie.datasource.write.partitionpath.field", "data_date").
> option("hoodie.datasource.write.precombine.field", "ts").
> option("hoodie.datasource.write.keygenerator.class", 
> "org.apache.hudi.keygen.TimestampBasedKeyGenerator").
> option("hoodie.deltastreamer.keygen.timebased.timestamp.type", "DATE_STRING").
> option("hoodie.deltastreamer.keygen.timebased.output.dateformat", 
> "/MM/dd").
> option("hoodie.deltastreamer.keygen.timebased.timezone", "GMT+8:00").
> option("hoodie.deltastreamer.keygen.timebased.input.dateformat", 
> "-MM-dd").
> mode(org.apache.spark.sql.SaveMode.Append).
> save("file:///tmp/hudi/issue_4417_cow") 
> +---++--+--++---++---+---+--+
>  
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|age| ts| data_date| 
> +---++--+--++---++---+---+--+
>  |  20220110172721896|20220110172721896...|                 2|            
> 2018/09/24|81cc7819-a0d1-4e6...|  2|  z3| 35| v1|2018/09/24| |  
> 20220110172721896|20220110172721896...|                 1|            
> 2018/09/23|d428019b-a829-41a...|  1|  z3| 30| v1|2018/09/23| 
>

[jira] [Updated] (HUDI-4879) MERGE INTO fails when setting "hoodie.datasource.write.payload.class"

2022-09-29 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4879:
--
Reviewers: Alexey Kudinkin

> MERGE INTO fails when setting "hoodie.datasource.write.payload.class"
> -
>
> Key: HUDI-4879
> URL: https://issues.apache.org/jira/browse/HUDI-4879
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Jian Feng
>Priority: Blocker
> Fix For: 0.12.1
>
>
> As reported by the user:
> [https://github.com/apache/hudi/issues/6354]
>  
> Currently, setting \{{hoodie.datasource.write.payload.class = 
> 'org.apache.hudi.common.model.DefaultHoodieRecordPayload' }}will result in 
> the following exception:
> {code:java}
> org.apache.hudi.exception.HoodieUpsertExceptio
> n: Error upserting bucketType UPDATE for partition :0 at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329)
> at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
> at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
> at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
> at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
> at 
> org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1498)
> at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merg
> e new record with old value in storage, for new record 
> {HoodieRecord{key=HoodieKey { recordKey=id:1 partitionPath=}, 
> currentLocation='HoodieRecordLocation {instantTime=20220810095846644, 
> fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}', 
> newLocation='HoodieRecordLocation {instantTime=20220810101719437, 
> fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}'}}, old value 
> {{"_hoodie_commit_time": "20220810095824514", "_hoodie_commit_seqno": 
> "20220810095824514_0_0", "_hoodie_record_key": "id:1", 
> "_hoodie_partition_path": "", "_hoodie_file_name": 
> "60c04f95-ca5e-4f82-9558-40da29cc022e-0_0-937-24808_20220810095846644.parquet",
>  "id": 1, "name": "a0", "ts": 1000}} at 
> org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
> at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
> at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
> at 
> org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
> ... 28 more
> Caused by: org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieUpsertException: Failed to

1 2 3 >

1 - 100 of 251 matches

Mail list logo