[jira] [Assigned] (HUDI-1674) add partition level delete DOC or example

2021-08-12 Thread liujinhui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui reassigned HUDI-1674:
---

Assignee: liujinhui

> add partition level delete DOC or example
> -
>
> Key: HUDI-1674
> URL: https://issues.apache.org/jira/browse/HUDI-1674
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liujinhui
>Priority: Minor
>  Labels: docs, user-support-issues
> Attachments: image-2021-03-08-09-57-05-768.png
>
>
> !image-2021-03-08-09-57-05-768.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1674) add partition level delete DOC or example

2021-08-12 Thread liujinhui (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398470#comment-17398470
 ] 

liujinhui commented on HUDI-1674:
-

[~309637554] 
[~shivnarayan] 
I am interested in completing this work

> add partition level delete DOC or example
> -
>
> Key: HUDI-1674
> URL: https://issues.apache.org/jira/browse/HUDI-1674
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liujinhui
>Priority: Minor
>  Labels: docs, user-support-issues
> Attachments: image-2021-03-08-09-57-05-768.png
>
>
> !image-2021-03-08-09-57-05-768.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-2279) Support column name matching for insert * and update set * in merge into when sourceTable's columns contains all targetTable's columns

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398460#comment-17398460
 ] 

ASF GitHub Bot commented on HUDI-2279:
--

pengzhiwei2018 commented on pull request #3415:
URL: https://github.com/apache/hudi/pull/3415#issuecomment-898216727


   > > Hi @dongkelun , Thanks for the contribution for this. Overall LGTM 
except some minor optimize. And also you can run the test case in spark3 by the 
follow command:
   > > > mvn clean install -DskipTests -Pspark3
   > > > mvn test -Punit-tests -Pspark3 -pl hudi-spark-datasource/hudi-spark
   > 
   > Hi, @pengzhiwei2018 The result is:'Tests: succeeded 56, failed 6, canceled 
0, ignored 0, pending 0'.Two of them are ORC exceptions, and the other three I 
think are due to time zone differences, but I don't know how to solve the time 
zone difference, and the other one is the mismatch of exception information. 
The detailed results are as follows:
   > 
   > `1、Test Different Type of Partition Column *** FAILED *** Expected 
Array([1,a1,10,2021-05-20 00:00:00], [2,a2,10,2021-05-20 00:00:00]), but got 
Array([1,a1,10.0,2021-05-20 15:00:00], [2,a2,10.0,2021-05-20 15:00:00]) 2、- 
Test MergeInto Exception *** FAILED *** Expected "... for target field: '[id]' 
in merge into upda...", but got "... for target field: '[_ts]' in merge into 
upda..." (TestHoodieSqlBase.scala:86) 3、test basic HoodieSparkSqlWriter 
functionality with datasource insert for COPY_ON_WRITE with ORC as the base 
file format with populate meta fields true *** FAILED *** 4、test basic 
HoodieSparkSqlWriter functionality with datasource insert for MERGE_ON_READ 
with ORC as the base file format with populate meta fields true *** FAILED *** 
5、Test Sql Statements *** FAILED *** java.lang.IllegalArgumentException: 
UnExpect result for: select id, name, price, cast(dt as string) from h0_p 
Expect: 1 a1 10 2021-05-07 00:00:00, Actual: 1 a1 10 2021-05-07 15:00:00 6、Test 
Create Table As Select *** FAILED *** Expected Array([1,a1,10,2021-05-06 
00:00:00]), but got Array([1,a1,10,2021-05-06 15:00:00]) 
(TestHoodieSqlBase.scala:78) `
   
   I have rebased the code to the master and test for spark3. Except the test 
for orc, others has passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support column name matching for insert * and update set *  in merge into 
> when sourceTable's columns contains all targetTable's columns
> ---
>
> Key: HUDI-2279
> URL: https://issues.apache.org/jira/browse/HUDI-2279
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Example:
> {code:java}
> val tableName = generateTableName
> // Create table
> spark.sql(
>  s"""
>  |create table $tableName (
>  | id int,
>  | name string,
>  | price double,
>  | ts long,
>  | dt string
>  |) using hudi
>  | location '${tmp.getCanonicalPath}/$tableName'
>  | options (
>  | primaryKey ='id',
>  | preCombineField = 'ts'
>  | )
>  """.stripMargin)
> spark.sql(
>   s"""
>  |merge into $tableName as t0
>  |using (
>  |  select 1 as id, '2021-05-05' as dt, 1002 as ts, 97 as price, 'a1' as 
> name union all
>  |  select 1 as id, '2021-05-05' as dt, 1003 as ts, 98 as price, 'a2' as 
> name union all
>  |  select 2 as id, '2021-05-05' as dt, 1001 as ts, 99 as price, 'a3' as 
> name
>  | ) as s0
>  |on t0.id = s0.id
>  |when matched then update set *
>  |when not matched  then insert *
>  |""".stripMargin)
> spark.sql(s"select id, name, price, ts, dt from $tableName").show(){code}
> Fow now,the result is:
> +---+--+-+---+---+
> | id| name|price| ts| dt|
> +---+--+-+---+---+
> | 2|2021-05-05| 99.0| 99| a3|
> | 1|2021-05-05| 98.0| 98| a2|
> +---+--+-+---+---+
> When the order of the column types of souceTable is different from that of 
> the column types of targetTable
>  
> {code:java}
> spark.sql(
>   s"""
>  |merge into ${tableName} as t0
>  |using (
>  |  select 1 as id, 'a1' as name, 1002 as ts, '2021-05-05' as dt, 97 as 
> price union all
>  |  select 1 as id, 'a2' as name, 1003 as ts, '2021-05-05' as dt, 98 as 
> price union all
>  |  select 2 as id, 'a3' as name, 1001 as ts, '2021-05-05' as dt, 99 as 
> price
>  | ) as s0
>  |on t0.id = s0.id
>  |when matched 

[GitHub] [hudi] pengzhiwei2018 commented on pull request #3415: [HUDI-2279]Support column name matching for insert * and update set *

2021-08-12 Thread GitBox


pengzhiwei2018 commented on pull request #3415:
URL: https://github.com/apache/hudi/pull/3415#issuecomment-898216727


   > > Hi @dongkelun , Thanks for the contribution for this. Overall LGTM 
except some minor optimize. And also you can run the test case in spark3 by the 
follow command:
   > > > mvn clean install -DskipTests -Pspark3
   > > > mvn test -Punit-tests -Pspark3 -pl hudi-spark-datasource/hudi-spark
   > 
   > Hi, @pengzhiwei2018 The result is:'Tests: succeeded 56, failed 6, canceled 
0, ignored 0, pending 0'.Two of them are ORC exceptions, and the other three I 
think are due to time zone differences, but I don't know how to solve the time 
zone difference, and the other one is the mismatch of exception information. 
The detailed results are as follows:
   > 
   > `1、Test Different Type of Partition Column *** FAILED *** Expected 
Array([1,a1,10,2021-05-20 00:00:00], [2,a2,10,2021-05-20 00:00:00]), but got 
Array([1,a1,10.0,2021-05-20 15:00:00], [2,a2,10.0,2021-05-20 15:00:00]) 2、- 
Test MergeInto Exception *** FAILED *** Expected "... for target field: '[id]' 
in merge into upda...", but got "... for target field: '[_ts]' in merge into 
upda..." (TestHoodieSqlBase.scala:86) 3、test basic HoodieSparkSqlWriter 
functionality with datasource insert for COPY_ON_WRITE with ORC as the base 
file format with populate meta fields true *** FAILED *** 4、test basic 
HoodieSparkSqlWriter functionality with datasource insert for MERGE_ON_READ 
with ORC as the base file format with populate meta fields true *** FAILED *** 
5、Test Sql Statements *** FAILED *** java.lang.IllegalArgumentException: 
UnExpect result for: select id, name, price, cast(dt as string) from h0_p 
Expect: 1 a1 10 2021-05-07 00:00:00, Actual: 1 a1 10 2021-05-07 15:00:00 6、Test 
Crea
 te Table As Select *** FAILED *** Expected Array([1,a1,10,2021-05-06 
00:00:00]), but got Array([1,a1,10,2021-05-06 15:00:00]) 
(TestHoodieSqlBase.scala:78) `
   
   I have rebased the code to the master and test for spark3. Except the test 
for orc, others has passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2279) Support column name matching for insert * and update set * in merge into when sourceTable's columns contains all targetTable's columns

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398459#comment-17398459
 ] 

ASF GitHub Bot commented on HUDI-2279:
--

pengzhiwei2018 merged pull request #3415:
URL: https://github.com/apache/hudi/pull/3415


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support column name matching for insert * and update set *  in merge into 
> when sourceTable's columns contains all targetTable's columns
> ---
>
> Key: HUDI-2279
> URL: https://issues.apache.org/jira/browse/HUDI-2279
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Example:
> {code:java}
> val tableName = generateTableName
> // Create table
> spark.sql(
>  s"""
>  |create table $tableName (
>  | id int,
>  | name string,
>  | price double,
>  | ts long,
>  | dt string
>  |) using hudi
>  | location '${tmp.getCanonicalPath}/$tableName'
>  | options (
>  | primaryKey ='id',
>  | preCombineField = 'ts'
>  | )
>  """.stripMargin)
> spark.sql(
>   s"""
>  |merge into $tableName as t0
>  |using (
>  |  select 1 as id, '2021-05-05' as dt, 1002 as ts, 97 as price, 'a1' as 
> name union all
>  |  select 1 as id, '2021-05-05' as dt, 1003 as ts, 98 as price, 'a2' as 
> name union all
>  |  select 2 as id, '2021-05-05' as dt, 1001 as ts, 99 as price, 'a3' as 
> name
>  | ) as s0
>  |on t0.id = s0.id
>  |when matched then update set *
>  |when not matched  then insert *
>  |""".stripMargin)
> spark.sql(s"select id, name, price, ts, dt from $tableName").show(){code}
> Fow now,the result is:
> +---+--+-+---+---+
> | id| name|price| ts| dt|
> +---+--+-+---+---+
> | 2|2021-05-05| 99.0| 99| a3|
> | 1|2021-05-05| 98.0| 98| a2|
> +---+--+-+---+---+
> When the order of the column types of souceTable is different from that of 
> the column types of targetTable
>  
> {code:java}
> spark.sql(
>   s"""
>  |merge into ${tableName} as t0
>  |using (
>  |  select 1 as id, 'a1' as name, 1002 as ts, '2021-05-05' as dt, 97 as 
> price union all
>  |  select 1 as id, 'a2' as name, 1003 as ts, '2021-05-05' as dt, 98 as 
> price union all
>  |  select 2 as id, 'a3' as name, 1001 as ts, '2021-05-05' as dt, 99 as 
> price
>  | ) as s0
>  |on t0.id = s0.id
>  |when matched then update set *
>  |when not matched  then insert *
>  |""".stripMargin){code}
>  
> It will throw an exception:
> {code:java}
> [ERROR] 2021-08-05 21:48:53,941 org.apache.hudi.io.HoodieWriteHandle  - Error 
> writing record HoodieRecord{key=HoodieKey { recordKey=id:2 partitionPath=}, 
> currentLocation='null', newLocation='null'}
> java.lang.RuntimeException: Error in execute expression: 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer.
> Expressions is: [boundreference() AS `id`  boundreference() AS `name`  
> CAST(boundreference() AS `price` AS DOUBLE)  CAST(boundreference() AS `ts` AS 
> BIGINT)  CAST(boundreference() AS `dt` AS STRING)]
> CodeBody is: {
> ..
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to 
> java.lang.IntegerCaused by: java.lang.ClassCastException: 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer 
> at 
> org.apache.hudi.sql.payload.ExpressionPayloadEvaluator_366797ae_4c30_4862_8222_7be486ede4f8.eval(Unknown
>  Source) at 
> org.apache.spark.sql.hudi.command.payload.ExpressionPayload.org$apache$spark$sql$hudi$command$payload$ExpressionPayload$$evaluate(ExpressionPayload.scala:258)
>  ... 18 more{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch master updated (0544d70 -> 6602e55)

2021-08-12 Thread zhiwei
This is an automated email from the ASF dual-hosted git repository.

zhiwei pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 0544d70  [MINOR] Deprecate older configs (#3464)
 add 6602e55  [HUDI-2279]Support column name matching for insert * and 
update set * in merge into (#3415)

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/hudi/analysis/HoodieAnalysis.scala   | 24 +--
 .../spark/sql/hudi/TestMergeIntoTable2.scala   | 83 ++
 2 files changed, 102 insertions(+), 5 deletions(-)


[GitHub] [hudi] pengzhiwei2018 merged pull request #3415: [HUDI-2279]Support column name matching for insert * and update set *

2021-08-12 Thread GitBox


pengzhiwei2018 merged pull request #3415:
URL: https://github.com/apache/hudi/pull/3415


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2279) Support column name matching for insert * and update set * in merge into when sourceTable's columns contains all targetTable's columns

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398457#comment-17398457
 ] 

ASF GitHub Bot commented on HUDI-2279:
--

pengzhiwei2018 commented on pull request #3415:
URL: https://github.com/apache/hudi/pull/3415#issuecomment-898215349


   LGTM, Thanks for the contribution @dongkelun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support column name matching for insert * and update set *  in merge into 
> when sourceTable's columns contains all targetTable's columns
> ---
>
> Key: HUDI-2279
> URL: https://issues.apache.org/jira/browse/HUDI-2279
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Example:
> {code:java}
> val tableName = generateTableName
> // Create table
> spark.sql(
>  s"""
>  |create table $tableName (
>  | id int,
>  | name string,
>  | price double,
>  | ts long,
>  | dt string
>  |) using hudi
>  | location '${tmp.getCanonicalPath}/$tableName'
>  | options (
>  | primaryKey ='id',
>  | preCombineField = 'ts'
>  | )
>  """.stripMargin)
> spark.sql(
>   s"""
>  |merge into $tableName as t0
>  |using (
>  |  select 1 as id, '2021-05-05' as dt, 1002 as ts, 97 as price, 'a1' as 
> name union all
>  |  select 1 as id, '2021-05-05' as dt, 1003 as ts, 98 as price, 'a2' as 
> name union all
>  |  select 2 as id, '2021-05-05' as dt, 1001 as ts, 99 as price, 'a3' as 
> name
>  | ) as s0
>  |on t0.id = s0.id
>  |when matched then update set *
>  |when not matched  then insert *
>  |""".stripMargin)
> spark.sql(s"select id, name, price, ts, dt from $tableName").show(){code}
> Fow now,the result is:
> +---+--+-+---+---+
> | id| name|price| ts| dt|
> +---+--+-+---+---+
> | 2|2021-05-05| 99.0| 99| a3|
> | 1|2021-05-05| 98.0| 98| a2|
> +---+--+-+---+---+
> When the order of the column types of souceTable is different from that of 
> the column types of targetTable
>  
> {code:java}
> spark.sql(
>   s"""
>  |merge into ${tableName} as t0
>  |using (
>  |  select 1 as id, 'a1' as name, 1002 as ts, '2021-05-05' as dt, 97 as 
> price union all
>  |  select 1 as id, 'a2' as name, 1003 as ts, '2021-05-05' as dt, 98 as 
> price union all
>  |  select 2 as id, 'a3' as name, 1001 as ts, '2021-05-05' as dt, 99 as 
> price
>  | ) as s0
>  |on t0.id = s0.id
>  |when matched then update set *
>  |when not matched  then insert *
>  |""".stripMargin){code}
>  
> It will throw an exception:
> {code:java}
> [ERROR] 2021-08-05 21:48:53,941 org.apache.hudi.io.HoodieWriteHandle  - Error 
> writing record HoodieRecord{key=HoodieKey { recordKey=id:2 partitionPath=}, 
> currentLocation='null', newLocation='null'}
> java.lang.RuntimeException: Error in execute expression: 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer.
> Expressions is: [boundreference() AS `id`  boundreference() AS `name`  
> CAST(boundreference() AS `price` AS DOUBLE)  CAST(boundreference() AS `ts` AS 
> BIGINT)  CAST(boundreference() AS `dt` AS STRING)]
> CodeBody is: {
> ..
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to 
> java.lang.IntegerCaused by: java.lang.ClassCastException: 
> org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer 
> at 
> org.apache.hudi.sql.payload.ExpressionPayloadEvaluator_366797ae_4c30_4862_8222_7be486ede4f8.eval(Unknown
>  Source) at 
> org.apache.spark.sql.hudi.command.payload.ExpressionPayload.org$apache$spark$sql$hudi$command$payload$ExpressionPayload$$evaluate(ExpressionPayload.scala:258)
>  ... 18 more{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] pengzhiwei2018 commented on pull request #3415: [HUDI-2279]Support column name matching for insert * and update set *

2021-08-12 Thread GitBox


pengzhiwei2018 commented on pull request #3415:
URL: https://github.com/apache/hudi/pull/3415#issuecomment-898215349


   LGTM, Thanks for the contribution @dongkelun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1897) Implement DeltaStreamer Source for AWS S3

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398453#comment-17398453
 ] 

ASF GitHub Bot commented on HUDI-1897:
--

codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688264818



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to 
process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default 
selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties 
props) {
+String sourceSelectorClass =
+props.getString(
+CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+CloudObjectsMetaSelector.class.getName());
+try {
+  CloudObjectsMetaSelector selector =
+  (CloudObjectsMetaSelector)
+  ReflectionUtils.loadClass(
+  sourceSelectorClass, new Class[] {TypedProperties.class}, 
props);
+
+  log.info("Using path selector " + selector.getClass().getName());
+  return selector;
+} catch (Exception e) {
+  throw new HoodieException("Could not load source selector class " + 
sourceSelectorClass, e);
+}
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It 
will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List> getEligibleEvents(

Review comment:
   Makes sense. Going with validEvents as records sounds too generic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement DeltaStreamer Source for AWS S3
> -
>
> Key: HUDI-1897
> URL: https://issues.apache.org/jira/browse/HUDI-1897
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
>
> Consider
> [https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html]
> and 
> https://docs.databricks.com/spark/latest/structured-streaming/sqs.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

2021-08-12 Thread GitBox


codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688264818



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to 
process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default 
selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties 
props) {
+String sourceSelectorClass =
+props.getString(
+CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+CloudObjectsMetaSelector.class.getName());
+try {
+  CloudObjectsMetaSelector selector =
+  (CloudObjectsMetaSelector)
+  ReflectionUtils.loadClass(
+  sourceSelectorClass, new Class[] {TypedProperties.class}, 
props);
+
+  log.info("Using path selector " + selector.getClass().getName());
+  return selector;
+} catch (Exception e) {
+  throw new HoodieException("Could not load source selector class " + 
sourceSelectorClass, e);
+}
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It 
will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List> getEligibleEvents(

Review comment:
   Makes sense. Going with validEvents as records sounds too generic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2259) [SQL]Support referencing subquery with column aliases by table alias in merge into

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398452#comment-17398452
 ] 

ASF GitHub Bot commented on HUDI-2259:
--

dongkelun commented on pull request #3380:
URL: https://github.com/apache/hudi/pull/3380#issuecomment-898210263


   > > @pengzhiwei2018 Hi,When I test Spark3, I find that Spark SQL for Hoodie 
with Spark3 uses the source code of Spark, but columns aliases in Merge Into is 
not supported in Spark3, it will throw the following exception: 'Columns 
aliases are not allowed in MERGE.'.I think there are two solutions, one is to 
modify the source code of Spark3 to make Spark support, the other is to write 
code in hudi-spark3 to implement Spark SQL for Hoodie, but I personally feel 
that this is a big change, I do not know if I understand correctly. So I was 
hoping you could help with some advice.
   > > ` // org.apache.spark.sql.catalyst.parser.AstBuilder
   > > val sourceTableAlias = getTableAliasWithoutColumnAlias(ctx.sourceAlias, 
"MERGE")
   > > private def getTableAliasWithoutColumnAlias(
   > > ctx: TableAliasContext, op: String): Option[String] = {
   > > if (ctx == null) {
   > > None
   > > } else {
   > > val ident = ctx.strictIdentifier()
   > > if (ctx.identifierList() != null) {
   > > throw new ParseException(s"Columns aliases are not allowed in $op.", 
ctx.identifierList())
   > > }
   > > if (ident != null) Some(ident.getText) else None
   > > }
   > > }`
   > 
   > I think we can support this feature only for spark2 currently. You can 
change the test case by `HoodieSqlUtils#isSpark3`, if it spark3, use the 
`checkException` to validate the exception for spark3, For spark2, check the 
answer.
   
   Ok, I'll change the test case as you suggested


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL]Support referencing subquery with column aliases by table alias in merge 
> into
> --
>
> Key: HUDI-2259
> URL: https://issues.apache.org/jira/browse/HUDI-2259
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
>  
>  Example:
> {code:java}
> val tableName = "test_hudi_table"
> spark.sql(
> s"""
> create table ${tableName} (
> id int,
> name string,
> price double,
> ts long
> ) using hudi
> options (
> primaryKey = 'id',
> type = 'cow'
> )
> location '/tmp/${tableName}'
> """.stripMargin)
> spark.sql(
> s"""
> merge into $tableName as t0
> using (
> select 1, 'a1', 12, 1003
> ) s0 (id,name,price,ts)
> on s0.id = t0.id
> when matched and id != 1 then update set *
> when matched and s0.id = 1 then delete
> when not matched then insert *
> """.stripMargin)
> {code}
> It will throw an exception:
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> resolve 's0.id in (`s0.id` = `t0.id`), the input columns is: id#4, name#5, 
> price#6, ts#7, _hoodie_commit_time#8, _hoodie_commit_seqno#9, 
> _hoodie_record_key#10, _hoodie_partition_path#11, _hoodie_file_name#12, 
> id#13, name#14, price#15, ts#16L;Exception in thread "main" 
> org.apache.spark.sql.AnalysisException: Cannot resolve 's0.id in (`s0.id` = 
> `t0.id`), the input columns is: id#4, name#5, price#6, ts#7, 
> _hoodie_commit_time#8, _hoodie_commit_seqno#9, _hoodie_record_key#10, 
> _hoodie_partition_path#11, _hoodie_file_name#12, id#13, name#14, price#15, 
> ts#16L; at 
> org.apache.spark.sql.hudi.analysis.HoodieResolveReferences.org$apache$spark$sql$hudi$analysis$HoodieResolveReferences$$resolveExpressionFrom(HoodieAnalysis.scala:292)
>  at 
> org.apache.spark.sql.hudi.analysis.HoodieResolveReferences$$anonfun$apply$1.applyOrElse(HoodieAnalysis.scala:160)
>  at 
> org.apache.spark.sql.hudi.analysis.HoodieResolveReferences$$anonfun$apply$1.applyOrElse(HoodieAnalysis.scala:103)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOpe

[GitHub] [hudi] dongkelun commented on pull request #3380: [HUDI-2259]Support referencing subquery with column aliases by table alias in me…

2021-08-12 Thread GitBox


dongkelun commented on pull request #3380:
URL: https://github.com/apache/hudi/pull/3380#issuecomment-898210263


   > > @pengzhiwei2018 Hi,When I test Spark3, I find that Spark SQL for Hoodie 
with Spark3 uses the source code of Spark, but columns aliases in Merge Into is 
not supported in Spark3, it will throw the following exception: 'Columns 
aliases are not allowed in MERGE.'.I think there are two solutions, one is to 
modify the source code of Spark3 to make Spark support, the other is to write 
code in hudi-spark3 to implement Spark SQL for Hoodie, but I personally feel 
that this is a big change, I do not know if I understand correctly. So I was 
hoping you could help with some advice.
   > > ` // org.apache.spark.sql.catalyst.parser.AstBuilder
   > > val sourceTableAlias = getTableAliasWithoutColumnAlias(ctx.sourceAlias, 
"MERGE")
   > > private def getTableAliasWithoutColumnAlias(
   > > ctx: TableAliasContext, op: String): Option[String] = {
   > > if (ctx == null) {
   > > None
   > > } else {
   > > val ident = ctx.strictIdentifier()
   > > if (ctx.identifierList() != null) {
   > > throw new ParseException(s"Columns aliases are not allowed in $op.", 
ctx.identifierList())
   > > }
   > > if (ident != null) Some(ident.getText) else None
   > > }
   > > }`
   > 
   > I think we can support this feature only for spark2 currently. You can 
change the test case by `HoodieSqlUtils#isSpark3`, if it spark3, use the 
`checkException` to validate the exception for spark3, For spark2, check the 
answer.
   
   Ok, I'll change the test case as you suggested


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2279) Support column name matching for insert * and update set * in merge into when sourceTable's columns contains all targetTable's columns

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398448#comment-17398448
 ] 

ASF GitHub Bot commented on HUDI-2279:
--

dongkelun commented on pull request #3415:
URL: https://github.com/apache/hudi/pull/3415#issuecomment-898207251


   > Hi @dongkelun , Thanks for the contribution for this. Overall LGTM except 
some minor optimize. And also you can run the test case in spark3 by the follow 
command:
   > 
   > > mvn clean install -DskipTests -Pspark3
   > > mvn test -Punit-tests -Pspark3 -pl hudi-spark-datasource/hudi-spark
   
   Hi, @pengzhiwei2018 The result is:'Tests: succeeded 56, failed 6, canceled 
0, ignored 0, pending 0'.Two of them are ORC exceptions,  and the other three I 
think are due to time zone differences, but I don't know how to solve the time 
zone difference, and the other one is the mismatch of exception information.  
The detailed results are as follows:
   
   `
   1、Test Different Type of Partition Column *** FAILED ***
 Expected Array([1,a1,10,2021-05-20 00:00:00], [2,a2,10,2021-05-20 
00:00:00]), but got Array([1,a1,10.0,2021-05-20 15:00:00], 
[2,a2,10.0,2021-05-20 15:00:00])
   2、- Test MergeInto Exception *** FAILED ***
 Expected "... for target field: '[id]' in merge into upda...", but got 
"... for target field: '[_ts]' in merge into upda..." 
(TestHoodieSqlBase.scala:86)
   3、test basic HoodieSparkSqlWriter functionality with datasource insert for 
COPY_ON_WRITE with ORC as the base file format  with populate meta fields true 
*** FAILED ***
   4、test basic HoodieSparkSqlWriter functionality with datasource insert for 
MERGE_ON_READ with ORC as the base file format  with populate meta fields true 
*** FAILED ***
   5、Test Sql Statements *** FAILED ***
 java.lang.IllegalArgumentException: UnExpect result for: select id, name, 
price, cast(dt as string) from h0_p
   Expect:
1 a1 10 2021-05-07 00:00:00, Actual:
1 a1 10 2021-05-07 15:00:00
   6、Test Create Table As Select *** FAILED ***
 Expected Array([1,a1,10,2021-05-06 00:00:00]), but got 
Array([1,a1,10,2021-05-06 15:00:00]) (TestHoodieSqlBase.scala:78)  
   `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support column name matching for insert * and update set *  in merge into 
> when sourceTable's columns contains all targetTable's columns
> ---
>
> Key: HUDI-2279
> URL: https://issues.apache.org/jira/browse/HUDI-2279
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Example:
> {code:java}
> val tableName = generateTableName
> // Create table
> spark.sql(
>  s"""
>  |create table $tableName (
>  | id int,
>  | name string,
>  | price double,
>  | ts long,
>  | dt string
>  |) using hudi
>  | location '${tmp.getCanonicalPath}/$tableName'
>  | options (
>  | primaryKey ='id',
>  | preCombineField = 'ts'
>  | )
>  """.stripMargin)
> spark.sql(
>   s"""
>  |merge into $tableName as t0
>  |using (
>  |  select 1 as id, '2021-05-05' as dt, 1002 as ts, 97 as price, 'a1' as 
> name union all
>  |  select 1 as id, '2021-05-05' as dt, 1003 as ts, 98 as price, 'a2' as 
> name union all
>  |  select 2 as id, '2021-05-05' as dt, 1001 as ts, 99 as price, 'a3' as 
> name
>  | ) as s0
>  |on t0.id = s0.id
>  |when matched then update set *
>  |when not matched  then insert *
>  |""".stripMargin)
> spark.sql(s"select id, name, price, ts, dt from $tableName").show(){code}
> Fow now,the result is:
> +---+--+-+---+---+
> | id| name|price| ts| dt|
> +---+--+-+---+---+
> | 2|2021-05-05| 99.0| 99| a3|
> | 1|2021-05-05| 98.0| 98| a2|
> +---+--+-+---+---+
> When the order of the column types of souceTable is different from that of 
> the column types of targetTable
>  
> {code:java}
> spark.sql(
>   s"""
>  |merge into ${tableName} as t0
>  |using (
>  |  select 1 as id, 'a1' as name, 1002 as ts, '2021-05-05' as dt, 97 as 
> price union all
>  |  select 1 as id, 'a2' as name, 1003 as ts, '2021-05-05' as dt, 98 as 
> price union all
>  |  select 2 as id, 'a3' as name, 1001 as ts, '2021-05-05' as dt, 99 as 
> price
>  | ) as s0
>  |on t0.id = s0.id
>  |when matched then update set *
>  |when not matched  then insert *
>  |""".

[GitHub] [hudi] dongkelun commented on pull request #3415: [HUDI-2279]Support column name matching for insert * and update set *

2021-08-12 Thread GitBox


dongkelun commented on pull request #3415:
URL: https://github.com/apache/hudi/pull/3415#issuecomment-898207251


   > Hi @dongkelun , Thanks for the contribution for this. Overall LGTM except 
some minor optimize. And also you can run the test case in spark3 by the follow 
command:
   > 
   > > mvn clean install -DskipTests -Pspark3
   > > mvn test -Punit-tests -Pspark3 -pl hudi-spark-datasource/hudi-spark
   
   Hi, @pengzhiwei2018 The result is:'Tests: succeeded 56, failed 6, canceled 
0, ignored 0, pending 0'.Two of them are ORC exceptions,  and the other three I 
think are due to time zone differences, but I don't know how to solve the time 
zone difference, and the other one is the mismatch of exception information.  
The detailed results are as follows:
   
   `
   1、Test Different Type of Partition Column *** FAILED ***
 Expected Array([1,a1,10,2021-05-20 00:00:00], [2,a2,10,2021-05-20 
00:00:00]), but got Array([1,a1,10.0,2021-05-20 15:00:00], 
[2,a2,10.0,2021-05-20 15:00:00])
   2、- Test MergeInto Exception *** FAILED ***
 Expected "... for target field: '[id]' in merge into upda...", but got 
"... for target field: '[_ts]' in merge into upda..." 
(TestHoodieSqlBase.scala:86)
   3、test basic HoodieSparkSqlWriter functionality with datasource insert for 
COPY_ON_WRITE with ORC as the base file format  with populate meta fields true 
*** FAILED ***
   4、test basic HoodieSparkSqlWriter functionality with datasource insert for 
MERGE_ON_READ with ORC as the base file format  with populate meta fields true 
*** FAILED ***
   5、Test Sql Statements *** FAILED ***
 java.lang.IllegalArgumentException: UnExpect result for: select id, name, 
price, cast(dt as string) from h0_p
   Expect:
1 a1 10 2021-05-07 00:00:00, Actual:
1 a1 10 2021-05-07 15:00:00
   6、Test Create Table As Select *** FAILED ***
 Expected Array([1,a1,10,2021-05-06 00:00:00]), but got 
Array([1,a1,10,2021-05-06 15:00:00]) (TestHoodieSqlBase.scala:78)  
   `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2303) TestMereIntoLogOnlyTable with metadata enabled surfaces likely bug

2021-08-12 Thread Prashant Wason (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398447#comment-17398447
 ] 

Prashant Wason commented on HUDI-2303:
--

So this is an issue with HoodieMetadataFileSystemView not overriding sync() 
where it should update the Metadata Reader to reflect the new state of the 
dataset. So the MetadataReader opened by the TimelineServer is not refreshed 
correctly and hence it returns the older (not containing the latest log file) 
listing and the compaction misses the latest log block.

This patch is already covered in [https://github.com/apache/hudi/pull/3210] 
which is about to be commited. So I wont be raising a new PR for this fix. 

Let's verify this one 3210 is merged.

> TestMereIntoLogOnlyTable with metadata enabled surfaces likely bug
> --
>
> Key: HUDI-2303
> URL: https://issues.apache.org/jira/browse/HUDI-2303
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Udit Mehrotra
>Assignee: Prashant Wason
>Priority: Major
>
> While enabling Metadata as part of 
> [https://github.com/apache/hudi/pull/3411/] one of the test that fails is 
> *TestMereIntoLogOnlyTable*.
> Upon looking a bit, what I found is after the final *Merge* command there is 
> an inline compaction that is triggered. The parquet file formed as part of 
> the compaction misses out on the data from the latest log file right before 
> compaction.
> I think it might be because of metadata returning an incorrect list for 
> compaction, missing out on the latest log file.
> cc [~pwason] [~vinoth]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-2303) TestMereIntoLogOnlyTable with metadata enabled surfaces likely bug

2021-08-12 Thread Prashant Wason (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398445#comment-17398445
 ] 

Prashant Wason commented on HUDI-2303:
--

The following diff fixes this issue:

 

{{}}

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java 
b/hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
index e408ad939..f365ed0a7 100644
--- a/hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
+++ b/hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
@@ -348,4 +348,8 @@ public abstract class BaseTableMetadata implements 
HoodieTableMetadata {
 return 
datasetMetaClient.getActiveTimeline().filterCompletedInstants().lastInstant()
 .map(HoodieInstant::getTimestamp).orElse(SOLO_COMMIT_TIMESTAMP);
 }
+
+ public HoodieMetadataConfig getMetadataConfig() {
+ return metadataConfig;
+ }
 }
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataFileSystemView.java
 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataFileSystemView.java
index a3d0e2dfe..7b0d5daef 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataFileSystemView.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataFileSystemView.java
@@ -36,7 +36,7 @@ import org.apache.hudi.exception.HoodieException;
 */
 public class HoodieMetadataFileSystemView extends HoodieTableFileSystemView {

- private final HoodieTableMetadata tableMetadata;
+ private HoodieTableMetadata tableMetadata;

public HoodieMetadataFileSystemView(HoodieTableMetaClient metaClient,
 HoodieTimeline visibleActiveTimeline,
@@ -73,4 +73,16 @@ public class HoodieMetadataFileSystemView extends 
HoodieTableFileSystemView {
 throw new HoodieException("Error closing metadata file system view.", e);
 }
 }
+
+ @Override
+ public void sync() {
+ // Sync the tableMetadata first as super.sync() may call listPartition
+ if (tableMetadata != null) {
+ BaseTableMetadata baseMetadata = (BaseTableMetadata) tableMetadata;
+ tableMetadata = HoodieTableMetadata.create(baseMetadata.getEngineContext(), 
baseMetadata.getMetadataConfig(),
+ metaClient.getBasePath(), 
FileSystemViewStorageConfig.FILESYSTEM_VIEW_SPILLABLE_DIR.defaultValue());
+ }
+
+ super.sync();
+ }
 }

> TestMereIntoLogOnlyTable with metadata enabled surfaces likely bug
> --
>
> Key: HUDI-2303
> URL: https://issues.apache.org/jira/browse/HUDI-2303
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Udit Mehrotra
>Assignee: Prashant Wason
>Priority: Major
>
> While enabling Metadata as part of 
> [https://github.com/apache/hudi/pull/3411/] one of the test that fails is 
> *TestMereIntoLogOnlyTable*.
> Upon looking a bit, what I found is after the final *Merge* command there is 
> an inline compaction that is triggered. The parquet file formed as part of 
> the compaction misses out on the data from the latest log file right before 
> compaction.
> I think it might be because of metadata returning an incorrect list for 
> compaction, missing out on the latest log file.
> cc [~pwason] [~vinoth]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on issue #2919: [SUPPORT] Schema Evolution Failing Spark+Hudi- Adding New fields

2021-08-12 Thread GitBox


nsivabalan commented on issue #2919:
URL: https://github.com/apache/hudi/issues/2919#issuecomment-898181212


   @tandonraghav : we had a [fix](https://github.com/apache/hudi/pull/3137) 
sometime back wrt schema evolution and latest master should work. Can you try 
out w/ latest master and let us know how it goes. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2303) TestMereIntoLogOnlyTable with metadata enabled surfaces likely bug

2021-08-12 Thread Prashant Wason (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398431#comment-17398431
 ] 

Prashant Wason commented on HUDI-2303:
--

Confirmed: When the dataset is being compacted, the latest base file is being 
missed. Hence, the test fails as the last committed data is missing. Checked 
that there is no sync failure and table is consistent. The very next lookup 
from the table returns all data. So this is very specific to the inline 
compaction workflow.

I suspect it is to do with the TimelineServer or using an older 
HoodieBackedTableMetadata instance which reads the older data. 

> TestMereIntoLogOnlyTable with metadata enabled surfaces likely bug
> --
>
> Key: HUDI-2303
> URL: https://issues.apache.org/jira/browse/HUDI-2303
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Udit Mehrotra
>Assignee: Prashant Wason
>Priority: Major
>
> While enabling Metadata as part of 
> [https://github.com/apache/hudi/pull/3411/] one of the test that fails is 
> *TestMereIntoLogOnlyTable*.
> Upon looking a bit, what I found is after the final *Merge* command there is 
> an inline compaction that is triggered. The parquet file formed as part of 
> the compaction misses out on the data from the latest log file right before 
> compaction.
> I think it might be because of metadata returning an incorrect list for 
> compaction, missing out on the latest log file.
> cc [~pwason] [~vinoth]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-2119) Syncing of rollbacks to metadata table does not work in all cases

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398430#comment-17398430
 ] 

ASF GitHub Bot commented on HUDI-2119:
--

hudi-bot edited a comment on pull request #3210:
URL: https://github.com/apache/hudi/pull/3210#issuecomment-872541566


   
   ## CI report:
   
   * 96c73ba8ed99f51f31efdac0ab60d6b95bc782ed Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1699)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Syncing of rollbacks to metadata table does not work in all cases
> -
>
> Key: HUDI-2119
> URL: https://issues.apache.org/jira/browse/HUDI-2119
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> This is an issue with inline automatic rollbacks.
> Metadata table assumes that a rollbacks is to be applied if the 
> instant-being-rolled back has a timestamp less than the last deltacommit time 
> on the metadata timeline. We do not explicitly check if the 
> instant-being-rolled-back was actually written to metadata table.
> **A rollback adds a record to metadata table which "deletes" files from a 
> failed/earlier commit. If the files being deleted were never actually 
> committed to metadata table earlier, the deletes cannot be consolidated 
> during metadata table reads. This leads to a HoodieMetadataException as we 
> cannot differentiate this from a bug where we might have missed committing a 
> commit to metadata table.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot edited a comment on pull request #3210: [HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant.

2021-08-12 Thread GitBox


hudi-bot edited a comment on pull request #3210:
URL: https://github.com/apache/hudi/pull/3210#issuecomment-872541566


   
   ## CI report:
   
   * 96c73ba8ed99f51f31efdac0ab60d6b95bc782ed Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1699)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-2305) Fix marker-based rollback in 0.9.0

2021-08-12 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-2305:
---

Assignee: Ethan Guo

> Fix marker-based rollback in 0.9.0
> --
>
> Key: HUDI-2305
> URL: https://issues.apache.org/jira/browse/HUDI-2305
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2305) Fix marker-based rollback in 0.9.0

2021-08-12 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-2305:
---

 Summary: Fix marker-based rollback in 0.9.0
 Key: HUDI-2305
 URL: https://issues.apache.org/jira/browse/HUDI-2305
 Project: Apache Hudi
  Issue Type: Bug
  Components: Writer Core
Reporter: Ethan Guo
 Fix For: 0.9.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2304) Flip some config options for Flink

2021-08-12 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-2304:
-
Description: 
{{index.state.ttl}} default value changes from 1.5D to 0D, which means storing 
the index forever.

{{write.insert.drop.duplicates}} changes to true when table type is COW.

> Flip some config options for Flink
> --
>
> Key: HUDI-2304
> URL: https://issues.apache.org/jira/browse/HUDI-2304
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.9.0
>
>
> {{index.state.ttl}} default value changes from 1.5D to 0D, which means 
> storing the index forever.
> {{write.insert.drop.duplicates}} changes to true when table type is COW.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2304) Flip some config options for Flink

2021-08-12 Thread Danny Chen (Jira)
Danny Chen created HUDI-2304:


 Summary: Flip some config options for Flink
 Key: HUDI-2304
 URL: https://issues.apache.org/jira/browse/HUDI-2304
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Flink Integration
Reporter: Danny Chen
 Fix For: 0.9.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1897) Implement DeltaStreamer Source for AWS S3

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398422#comment-17398422
 ] 

ASF GitHub Bot commented on HUDI-1897:
--

vinothchandar commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688220583



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link 
CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by 
CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {

Review comment:
   again, does the schema work in general for any cloud store? if not, we 
can call this just S3EventsHoodieIncrSource or sth

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsDfsSource.java
##
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.CloudObjectsDfsSelector;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * This source provides capability to create the hoodie table from cloudObject 
data (eg. s3 events).
+ * It will primarily use cloud queue to fetch new object information and 
update hoodie table with
+ * cloud object data.
+ */
+public class CloudObjectsDfsSource extends RowSource {

Review comment:
   lets call this `S3EventSource` or `S3ActivitySource`? something specific 
to S3? It does not work with cloud object stores in general or sth right

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information

[GitHub] [hudi] vinothchandar commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

2021-08-12 Thread GitBox


vinothchandar commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688220583



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link 
CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by 
CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {

Review comment:
   again, does the schema work in general for any cloud store? if not, we 
can call this just S3EventsHoodieIncrSource or sth

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsDfsSource.java
##
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.CloudObjectsDfsSelector;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * This source provides capability to create the hoodie table from cloudObject 
data (eg. s3 events).
+ * It will primarily use cloud queue to fetch new object information and 
update hoodie table with
+ * cloud object data.
+ */
+public class CloudObjectsDfsSource extends RowSource {

Review comment:
   lets call this `S3EventSource` or `S3ActivitySource`? something specific 
to S3? It does not work with cloud object stores in general or sth right

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  

[GitHub] [hudi] vinothchandar merged pull request #3464: [MINOR] Deprecate older configs

2021-08-12 Thread GitBox


vinothchandar merged pull request #3464:
URL: https://github.com/apache/hudi/pull/3464


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2119) Syncing of rollbacks to metadata table does not work in all cases

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398418#comment-17398418
 ] 

ASF GitHub Bot commented on HUDI-2119:
--

hudi-bot edited a comment on pull request #3210:
URL: https://github.com/apache/hudi/pull/3210#issuecomment-872541566


   
   ## CI report:
   
   * a9dcb727c23272b2c8b74647b467f413b9e83f5d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1698)
 
   * 96c73ba8ed99f51f31efdac0ab60d6b95bc782ed Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1699)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Syncing of rollbacks to metadata table does not work in all cases
> -
>
> Key: HUDI-2119
> URL: https://issues.apache.org/jira/browse/HUDI-2119
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> This is an issue with inline automatic rollbacks.
> Metadata table assumes that a rollbacks is to be applied if the 
> instant-being-rolled back has a timestamp less than the last deltacommit time 
> on the metadata timeline. We do not explicitly check if the 
> instant-being-rolled-back was actually written to metadata table.
> **A rollback adds a record to metadata table which "deletes" files from a 
> failed/earlier commit. If the files being deleted were never actually 
> committed to metadata table earlier, the deletes cannot be consolidated 
> during metadata table reads. This leads to a HoodieMetadataException as we 
> cannot differentiate this from a bug where we might have missed committing a 
> commit to metadata table.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot edited a comment on pull request #3210: [HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant.

2021-08-12 Thread GitBox


hudi-bot edited a comment on pull request #3210:
URL: https://github.com/apache/hudi/pull/3210#issuecomment-872541566


   
   ## CI report:
   
   * a9dcb727c23272b2c8b74647b467f413b9e83f5d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1698)
 
   * 96c73ba8ed99f51f31efdac0ab60d6b95bc782ed Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1699)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2119) Syncing of rollbacks to metadata table does not work in all cases

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398417#comment-17398417
 ] 

ASF GitHub Bot commented on HUDI-2119:
--

hudi-bot edited a comment on pull request #3210:
URL: https://github.com/apache/hudi/pull/3210#issuecomment-872541566


   
   ## CI report:
   
   * a9dcb727c23272b2c8b74647b467f413b9e83f5d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1698)
 
   * 96c73ba8ed99f51f31efdac0ab60d6b95bc782ed UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Syncing of rollbacks to metadata table does not work in all cases
> -
>
> Key: HUDI-2119
> URL: https://issues.apache.org/jira/browse/HUDI-2119
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> This is an issue with inline automatic rollbacks.
> Metadata table assumes that a rollbacks is to be applied if the 
> instant-being-rolled back has a timestamp less than the last deltacommit time 
> on the metadata timeline. We do not explicitly check if the 
> instant-being-rolled-back was actually written to metadata table.
> **A rollback adds a record to metadata table which "deletes" files from a 
> failed/earlier commit. If the files being deleted were never actually 
> committed to metadata table earlier, the deletes cannot be consolidated 
> during metadata table reads. This leads to a HoodieMetadataException as we 
> cannot differentiate this from a bug where we might have missed committing a 
> commit to metadata table.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot edited a comment on pull request #3210: [HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant.

2021-08-12 Thread GitBox


hudi-bot edited a comment on pull request #3210:
URL: https://github.com/apache/hudi/pull/3210#issuecomment-872541566


   
   ## CI report:
   
   * a9dcb727c23272b2c8b74647b467f413b9e83f5d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1698)
 
   * 96c73ba8ed99f51f31efdac0ab60d6b95bc782ed UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1897) Implement DeltaStreamer Source for AWS S3

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398408#comment-17398408
 ] 

ASF GitHub Bot commented on HUDI-1897:
--

codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688207653



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsMetaSource.java
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.CloudObjectsMetaSelector;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * This source provides capability to create the hoodie table for cloudObject 
Metadata (eg. s3

Review comment:
   "hoodie cloud meta table" sounds like hoodie as cloud provider (or 
provisioned by hoodie). Instead, "hoodie table for cloud object metadata" 
sounds more clear. Wdyt?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement DeltaStreamer Source for AWS S3
> -
>
> Key: HUDI-1897
> URL: https://issues.apache.org/jira/browse/HUDI-1897
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
>
> Consider
> [https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html]
> and 
> https://docs.databricks.com/spark/latest/structured-streaming/sqs.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

2021-08-12 Thread GitBox


codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688207653



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsMetaSource.java
##
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.CloudObjectsMetaSelector;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * This source provides capability to create the hoodie table for cloudObject 
Metadata (eg. s3

Review comment:
   "hoodie cloud meta table" sounds like hoodie as cloud provider (or 
provisioned by hoodie). Instead, "hoodie table for cloud object metadata" 
sounds more clear. Wdyt?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1897) Implement DeltaStreamer Source for AWS S3

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398407#comment-17398407
 ] 

ASF GitHub Bot commented on HUDI-1897:
--

codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688206771



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud 
objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List ALLOWED_S3_EVENT_PREFIX =
+  Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = 
LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+DataSourceUtils.checkRequiredProperties(props, 
Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+this.props = props;
+this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+this.regionName = props.getString(Config.QUEUE_REGION);
+this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, 
"s3").toLowerCase();
+this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+this.maxMessageEachBatch = 
props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+this.visibilityTimeout = 
props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map getSqsQueueAttributes(AmazonSQS sqsClient, 
String queueUrl) {
+GetQueueAttributesResult queueAttributesResult =
+sqsClient.getQueueAttributes(
+new GetQueueAttributesRequest(queueUrl)
+.withAttributeNames("ApproximateNumberOfMessages"));
+return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject 
record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map getFileAttributesFromRecord(JSONObject record)
+  throws UnsupportedEncodingException {
+
+Map fileRecord = new HashMap<>();
+String eventTimeStr = 

[jira] [Commented] (HUDI-1897) Implement DeltaStreamer Source for AWS S3

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398406#comment-17398406
 ] 

ASF GitHub Bot commented on HUDI-1897:
--

codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688206707



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to 
process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default 
selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties 
props) {
+String sourceSelectorClass =
+props.getString(
+CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+CloudObjectsMetaSelector.class.getName());
+try {
+  CloudObjectsMetaSelector selector =
+  (CloudObjectsMetaSelector)
+  ReflectionUtils.loadClass(
+  sourceSelectorClass, new Class[] {TypedProperties.class}, 
props);
+
+  log.info("Using path selector " + selector.getClass().getName());
+  return selector;
+} catch (Exception e) {
+  throw new HoodieException("Could not load source selector class " + 
sourceSelectorClass, e);
+}
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It 
will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List> getEligibleEvents(
+  AmazonSQS sqs, List processedMessages) throws IOException {
+
+List> eligibleRecords = new ArrayList<>();
+List ineligibleMessages = new ArrayList<>();
+
+ReceiveMessageRequest receiveMessageRequest =
+new ReceiveMessageRequest()
+.withQueueUrl(this.queueUrl)
+.withWaitTimeSeconds(this.longPollWait)
+.withVisibilityTimeout(this.visibilityTimeout);
+receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+List messages =
+getMessagesToProcess(
+sqs,
+this.queueUrl,
+receiveMessageRequest,
+this.maxMessageEachBatch,
+this.maxMessagesEachRequest);
+
+for (Message message : messages) {
+  boolean isMessageDelete = Boolean.TRUE;
+
+  JSONObject messageBody = new JSONObject(message.getBody());
+  Map messageMap;
+  ObjectMapper mapper = new ObjectMapper();
+
+  if (messageBody.has("Message")) {
+// If this messages is from S3Event -> SNS -> SQS
+messageMap =
+(Map) 
mapper.readValue(messageBody.getString(

[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

2021-08-12 Thread GitBox


codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688206771



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud 
objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List ALLOWED_S3_EVENT_PREFIX =
+  Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = 
LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+DataSourceUtils.checkRequiredProperties(props, 
Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+this.props = props;
+this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+this.regionName = props.getString(Config.QUEUE_REGION);
+this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, 
"s3").toLowerCase();
+this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+this.maxMessageEachBatch = 
props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+this.visibilityTimeout = 
props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map getSqsQueueAttributes(AmazonSQS sqsClient, 
String queueUrl) {
+GetQueueAttributesResult queueAttributesResult =
+sqsClient.getQueueAttributes(
+new GetQueueAttributesRequest(queueUrl)
+.withAttributeNames("ApproximateNumberOfMessages"));
+return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject 
record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map getFileAttributesFromRecord(JSONObject record)
+  throws UnsupportedEncodingException {
+
+Map fileRecord = new HashMap<>();
+String eventTimeStr = record.getString("eventTime");
+long eventTime =
+
Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+String bucket =
+

[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

2021-08-12 Thread GitBox


codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688206707



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to 
process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default 
selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties 
props) {
+String sourceSelectorClass =
+props.getString(
+CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+CloudObjectsMetaSelector.class.getName());
+try {
+  CloudObjectsMetaSelector selector =
+  (CloudObjectsMetaSelector)
+  ReflectionUtils.loadClass(
+  sourceSelectorClass, new Class[] {TypedProperties.class}, 
props);
+
+  log.info("Using path selector " + selector.getClass().getName());
+  return selector;
+} catch (Exception e) {
+  throw new HoodieException("Could not load source selector class " + 
sourceSelectorClass, e);
+}
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It 
will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List> getEligibleEvents(
+  AmazonSQS sqs, List processedMessages) throws IOException {
+
+List> eligibleRecords = new ArrayList<>();
+List ineligibleMessages = new ArrayList<>();
+
+ReceiveMessageRequest receiveMessageRequest =
+new ReceiveMessageRequest()
+.withQueueUrl(this.queueUrl)
+.withWaitTimeSeconds(this.longPollWait)
+.withVisibilityTimeout(this.visibilityTimeout);
+receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+List messages =
+getMessagesToProcess(
+sqs,
+this.queueUrl,
+receiveMessageRequest,
+this.maxMessageEachBatch,
+this.maxMessagesEachRequest);
+
+for (Message message : messages) {
+  boolean isMessageDelete = Boolean.TRUE;
+
+  JSONObject messageBody = new JSONObject(message.getBody());
+  Map messageMap;
+  ObjectMapper mapper = new ObjectMapper();
+
+  if (messageBody.has("Message")) {
+// If this messages is from S3Event -> SNS -> SQS
+messageMap =
+(Map) 
mapper.readValue(messageBody.getString("Message"), Map.class);
+  } else {
+// If this messages is from S3Event -> SQS
+messageMap = (Map) 
mapper.readValue(messageBody.toString(), Map.class);
+  }
+  if (messageMap.containsKey("Records")) {
+List> record

[jira] [Assigned] (HUDI-2299) The log format DELETE block lose the info orderingVal

2021-08-12 Thread Zheng yunhong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng yunhong reassigned HUDI-2299:
---

Assignee: (was: Zheng yunhong)

> The log format DELETE block lose the info orderingVal
> -
>
> Key: HUDI-2299
> URL: https://issues.apache.org/jira/browse/HUDI-2299
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.10.0
>
>
> The append handle now always write data block first then delete block, and 
> the delete block only keeps the hoodie keys, when reading, the scanner just 
> read the DELETE block without any info of ordering value, thus, if the we 
> write two records:
> insert: {id: 0, ts: 2}
> delete: {id: 0, ts: 1}
> Finally the insert message is deleted !!!, this is a critical bug for 
> streaming write, we should fix it as soon as possible
> _*Here is the discussion on slack*_:
> Danny Chan  12:42 PM
> https://issues.apache.org/jira/browse/HUDI-2299
> 12:43
> Hi, @vc, our user found a critical bug for MOR log format, if there are 
> disorder DELETEs in the streaming messages, the event time of the DELETEs are 
> totally ignored.
> 12:44
> I guess this should be a blocker of 0.9 because it affect the correctness of 
> the data set.
> vc  12:44 PM
> if we can fix it by end of day friday PST
> 12:44
> we can add it
> 12:44
> Just want to cut a release this week.
> 12:45
> Do you have a sense for the fix? bandwidth to take it up?
> Danny Chan  12:46 PM
> I try to fix it but can not figure out a good way, if the DELETE block 
> records the orderingVal, the format breaks the compatibility.
> vc  1:05 PM
> We can version the format. thats doable. Should we precombine before even 
> logging the deeltes?
> Danny Chan  1:11 PM
> Yes, we should
> vc  1:26 PM
> I think, thats how its working today. Deletes don't have an ordering val per 
> se, right
> 1:28
> Delete block at t1 :
>   delete key k
> Data block at t2 :
>   ins key k with ordering val 2
> We can just fix it so that the insert shows up, since t2 > t1.
> For what kind of functionality you need, we need to do soft deletes i.e 
> updates with an ordering value instead of hard deletes
> 1:28
> makes sense?
> Danny Chan  1:32 PM
> we can but that’s not the perfect solution, especially if the dataset comes 
> from a CDC source, for example the MySQL binlog. There is no extra flag in 
> schema for soft delete though.
> 1:37
> In my opinion, it is not about soft DELETE or hard DELETE, even if we do a 
> soft DELETE, the event time (orderingVal) is still important for consumers 
> for versoning. (edited) 
> vc  1:57 PM
> tbh, I don't see us fixing this in two days
> 1:58
> lets do a 0.9.1 after this ?
> 1:58
> shortly after with a bunch of bug fixes and the large pending PRs
> 1:58
> we can even make it 0.10.0
> Danny Chan  1:58 PM
> Yes, the cut time is very soon. We can move the fix to next version.
> vc  1:59 PM
> We have some inconsistent semantics in places
> 1:59
> some are commit time (arrival time) based and some are orderingVal (event 
> time) based
> 2:00
> In the meantime, see HoodieDeleteBlockVersion you can just define a new 
> version for delete block alone for e,g
> 2:00
> and add more information



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2299) The log format DELETE block lose the info orderingVal

2021-08-12 Thread Zheng yunhong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng yunhong reassigned HUDI-2299:
---

Assignee: Zheng yunhong

> The log format DELETE block lose the info orderingVal
> -
>
> Key: HUDI-2299
> URL: https://issues.apache.org/jira/browse/HUDI-2299
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Danny Chen
>Assignee: Zheng yunhong
>Priority: Major
> Fix For: 0.10.0
>
>
> The append handle now always write data block first then delete block, and 
> the delete block only keeps the hoodie keys, when reading, the scanner just 
> read the DELETE block without any info of ordering value, thus, if the we 
> write two records:
> insert: {id: 0, ts: 2}
> delete: {id: 0, ts: 1}
> Finally the insert message is deleted !!!, this is a critical bug for 
> streaming write, we should fix it as soon as possible
> _*Here is the discussion on slack*_:
> Danny Chan  12:42 PM
> https://issues.apache.org/jira/browse/HUDI-2299
> 12:43
> Hi, @vc, our user found a critical bug for MOR log format, if there are 
> disorder DELETEs in the streaming messages, the event time of the DELETEs are 
> totally ignored.
> 12:44
> I guess this should be a blocker of 0.9 because it affect the correctness of 
> the data set.
> vc  12:44 PM
> if we can fix it by end of day friday PST
> 12:44
> we can add it
> 12:44
> Just want to cut a release this week.
> 12:45
> Do you have a sense for the fix? bandwidth to take it up?
> Danny Chan  12:46 PM
> I try to fix it but can not figure out a good way, if the DELETE block 
> records the orderingVal, the format breaks the compatibility.
> vc  1:05 PM
> We can version the format. thats doable. Should we precombine before even 
> logging the deeltes?
> Danny Chan  1:11 PM
> Yes, we should
> vc  1:26 PM
> I think, thats how its working today. Deletes don't have an ordering val per 
> se, right
> 1:28
> Delete block at t1 :
>   delete key k
> Data block at t2 :
>   ins key k with ordering val 2
> We can just fix it so that the insert shows up, since t2 > t1.
> For what kind of functionality you need, we need to do soft deletes i.e 
> updates with an ordering value instead of hard deletes
> 1:28
> makes sense?
> Danny Chan  1:32 PM
> we can but that’s not the perfect solution, especially if the dataset comes 
> from a CDC source, for example the MySQL binlog. There is no extra flag in 
> schema for soft delete though.
> 1:37
> In my opinion, it is not about soft DELETE or hard DELETE, even if we do a 
> soft DELETE, the event time (orderingVal) is still important for consumers 
> for versoning. (edited) 
> vc  1:57 PM
> tbh, I don't see us fixing this in two days
> 1:58
> lets do a 0.9.1 after this ?
> 1:58
> shortly after with a bunch of bug fixes and the large pending PRs
> 1:58
> we can even make it 0.10.0
> Danny Chan  1:58 PM
> Yes, the cut time is very soon. We can move the fix to next version.
> vc  1:59 PM
> We have some inconsistent semantics in places
> 1:59
> some are commit time (arrival time) based and some are orderingVal (event 
> time) based
> 2:00
> In the meantime, see HoodieDeleteBlockVersion you can just define a new 
> version for delete block alone for e,g
> 2:00
> and add more information



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1897) Implement DeltaStreamer Source for AWS S3

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398394#comment-17398394
 ] 

ASF GitHub Bot commented on HUDI-1897:
--

codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688202916



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to 
process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default 
selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties 
props) {

Review comment:
   It is being used in `CloudObjectsDfsSource`. Now that we don't need that 
source, its usage is limited to `CloudObjectsMetaSource` only. However, I think 
it's better to keep it as a static factory method, a) semantics in line with 
DFSPathSelector, b) could be useful in future as we add more cloud provider 
sources.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Implement DeltaStreamer Source for AWS S3
> -
>
> Key: HUDI-1897
> URL: https://issues.apache.org/jira/browse/HUDI-1897
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
>
> Consider
> [https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html]
> and 
> https://docs.databricks.com/spark/latest/structured-streaming/sqs.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

2021-08-12 Thread GitBox


codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688202916



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to 
process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default 
selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties 
props) {

Review comment:
   It is being used in `CloudObjectsDfsSource`. Now that we don't need that 
source, its usage is limited to `CloudObjectsMetaSource` only. However, I think 
it's better to keep it as a static factory method, a) semantics in line with 
DFSPathSelector, b) could be useful in future as we add more cloud provider 
sources.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2259) [SQL]Support referencing subquery with column aliases by table alias in merge into

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398389#comment-17398389
 ] 

ASF GitHub Bot commented on HUDI-2259:
--

pengzhiwei2018 commented on pull request #3380:
URL: https://github.com/apache/hudi/pull/3380#issuecomment-898115748


   > @pengzhiwei2018 Hi,When I test Spark3, I find that Spark SQL for Hoodie 
with Spark3 uses the source code of Spark, but columns aliases in Merge Into is 
not supported in Spark3, it will throw the following exception: 'Columns 
aliases are not allowed in MERGE.'.I think there are two solutions, one is to 
modify the source code of Spark3 to make Spark support, the other is to write 
code in hudi-spark3 to implement Spark SQL for Hoodie, but I personally feel 
that this is a big change, I do not know if I understand correctly. So I was 
hoping you could help with some advice.
   > 
   > ` // org.apache.spark.sql.catalyst.parser.AstBuilder
   > 
   > val sourceTableAlias = getTableAliasWithoutColumnAlias(ctx.sourceAlias, 
"MERGE")
   > private def getTableAliasWithoutColumnAlias(
   > ctx: TableAliasContext, op: String): Option[String] = {
   > if (ctx == null) {
   > None
   > } else {
   > val ident = ctx.strictIdentifier()
   > if (ctx.identifierList() != null) {
   > throw new ParseException(s"Columns aliases are not allowed in $op.", 
ctx.identifierList())
   > }
   > if (ident != null) Some(ident.getText) else None
   > }
   > }`
   
   I think we can support this feature only for spark2 currently. You can 
change the test case by `HoodieSqlUtils#isSpark3`, if it spark3, use the 
`checkException` to validate the exception for spark3, For spark2, check the 
answer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [SQL]Support referencing subquery with column aliases by table alias in merge 
> into
> --
>
> Key: HUDI-2259
> URL: https://issues.apache.org/jira/browse/HUDI-2259
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
>  
>  Example:
> {code:java}
> val tableName = "test_hudi_table"
> spark.sql(
> s"""
> create table ${tableName} (
> id int,
> name string,
> price double,
> ts long
> ) using hudi
> options (
> primaryKey = 'id',
> type = 'cow'
> )
> location '/tmp/${tableName}'
> """.stripMargin)
> spark.sql(
> s"""
> merge into $tableName as t0
> using (
> select 1, 'a1', 12, 1003
> ) s0 (id,name,price,ts)
> on s0.id = t0.id
> when matched and id != 1 then update set *
> when matched and s0.id = 1 then delete
> when not matched then insert *
> """.stripMargin)
> {code}
> It will throw an exception:
> {code:java}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot 
> resolve 's0.id in (`s0.id` = `t0.id`), the input columns is: id#4, name#5, 
> price#6, ts#7, _hoodie_commit_time#8, _hoodie_commit_seqno#9, 
> _hoodie_record_key#10, _hoodie_partition_path#11, _hoodie_file_name#12, 
> id#13, name#14, price#15, ts#16L;Exception in thread "main" 
> org.apache.spark.sql.AnalysisException: Cannot resolve 's0.id in (`s0.id` = 
> `t0.id`), the input columns is: id#4, name#5, price#6, ts#7, 
> _hoodie_commit_time#8, _hoodie_commit_seqno#9, _hoodie_record_key#10, 
> _hoodie_partition_path#11, _hoodie_file_name#12, id#13, name#14, price#15, 
> ts#16L; at 
> org.apache.spark.sql.hudi.analysis.HoodieResolveReferences.org$apache$spark$sql$hudi$analysis$HoodieResolveReferences$$resolveExpressionFrom(HoodieAnalysis.scala:292)
>  at 
> org.apache.spark.sql.hudi.analysis.HoodieResolveReferences$$anonfun$apply$1.applyOrElse(HoodieAnalysis.scala:160)
>  at 
> org.apache.spark.sql.hudi.analysis.HoodieResolveReferences$$anonfun$apply$1.applyOrElse(HoodieAnalysis.scala:103)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>  at 
> org.apache.spark.sql

[GitHub] [hudi] pengzhiwei2018 commented on pull request #3380: [HUDI-2259]Support referencing subquery with column aliases by table alias in me…

2021-08-12 Thread GitBox


pengzhiwei2018 commented on pull request #3380:
URL: https://github.com/apache/hudi/pull/3380#issuecomment-898115748


   > @pengzhiwei2018 Hi,When I test Spark3, I find that Spark SQL for Hoodie 
with Spark3 uses the source code of Spark, but columns aliases in Merge Into is 
not supported in Spark3, it will throw the following exception: 'Columns 
aliases are not allowed in MERGE.'.I think there are two solutions, one is to 
modify the source code of Spark3 to make Spark support, the other is to write 
code in hudi-spark3 to implement Spark SQL for Hoodie, but I personally feel 
that this is a big change, I do not know if I understand correctly. So I was 
hoping you could help with some advice.
   > 
   > ` // org.apache.spark.sql.catalyst.parser.AstBuilder
   > 
   > val sourceTableAlias = getTableAliasWithoutColumnAlias(ctx.sourceAlias, 
"MERGE")
   > private def getTableAliasWithoutColumnAlias(
   > ctx: TableAliasContext, op: String): Option[String] = {
   > if (ctx == null) {
   > None
   > } else {
   > val ident = ctx.strictIdentifier()
   > if (ctx.identifierList() != null) {
   > throw new ParseException(s"Columns aliases are not allowed in $op.", 
ctx.identifierList())
   > }
   > if (ident != null) Some(ident.getText) else None
   > }
   > }`
   
   I think we can support this feature only for spark2 currently. You can 
change the test case by `HoodieSqlUtils#isSpark3`, if it spark3, use the 
`checkException` to validate the exception for spark3, For spark2, check the 
answer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1264) incremental read support with replace

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398388#comment-17398388
 ] 

ASF GitHub Bot commented on HUDI-1264:
--

lw309637554 closed pull request #2199:
URL: https://github.com/apache/hudi/pull/2199


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> incremental read support with replace
> -
>
> Key: HUDI-1264
> URL: https://issues.apache.org/jira/browse/HUDI-1264
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> initial version, we could fail incremental reads if there is a REPLACE 
> instant. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1264) incremental read support with replace

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398387#comment-17398387
 ] 

ASF GitHub Bot commented on HUDI-1264:
--

lw309637554 commented on pull request #2199:
URL: https://github.com/apache/hudi/pull/2199#issuecomment-898113298


   > @lw309637554 Is this PR still valid given that #3139 is merged now?
   
   @codope hello , i think i can close this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> incremental read support with replace
> -
>
> Key: HUDI-1264
> URL: https://issues.apache.org/jira/browse/HUDI-1264
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> initial version, we could fail incremental reads if there is a REPLACE 
> instant. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] lw309637554 closed pull request #2199: [HUDI-1264][WIP] spark incremental read support with replace

2021-08-12 Thread GitBox


lw309637554 closed pull request #2199:
URL: https://github.com/apache/hudi/pull/2199


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] lw309637554 commented on pull request #2199: [HUDI-1264][WIP] spark incremental read support with replace

2021-08-12 Thread GitBox


lw309637554 commented on pull request #2199:
URL: https://github.com/apache/hudi/pull/2199#issuecomment-898113298


   > @lw309637554 Is this PR still valid given that #3139 is merged now?
   
   @codope hello , i think i can close this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2299) The log format DELETE block lose the info orderingVal

2021-08-12 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-2299:
-
Description: 
The append handle now always write data block first then delete block, and the 
delete block only keeps the hoodie keys, when reading, the scanner just read 
the DELETE block without any info of ordering value, thus, if the we write two 
records:

insert: {id: 0, ts: 2}
delete: {id: 0, ts: 1}

Finally the insert message is deleted !!!, this is a critical bug for streaming 
write, we should fix it as soon as possible

_*Here is the discussion on slack*_:

Danny Chan  12:42 PM
https://issues.apache.org/jira/browse/HUDI-2299
12:43
Hi, @vc, our user found a critical bug for MOR log format, if there are 
disorder DELETEs in the streaming messages, the event time of the DELETEs are 
totally ignored.
12:44
I guess this should be a blocker of 0.9 because it affect the correctness of 
the data set.

vc  12:44 PM
if we can fix it by end of day friday PST
12:44
we can add it
12:44
Just want to cut a release this week.
12:45
Do you have a sense for the fix? bandwidth to take it up?

Danny Chan  12:46 PM
I try to fix it but can not figure out a good way, if the DELETE block records 
the orderingVal, the format breaks the compatibility.

vc  1:05 PM
We can version the format. thats doable. Should we precombine before even 
logging the deeltes?

Danny Chan  1:11 PM
Yes, we should

vc  1:26 PM
I think, thats how its working today. Deletes don't have an ordering val per 
se, right
1:28
Delete block at t1 :
  delete key k
Data block at t2 :
  ins key k with ordering val 2
We can just fix it so that the insert shows up, since t2 > t1.
For what kind of functionality you need, we need to do soft deletes i.e updates 
with an ordering value instead of hard deletes
1:28
makes sense?

Danny Chan  1:32 PM
we can but that’s not the perfect solution, especially if the dataset comes 
from a CDC source, for example the MySQL binlog. There is no extra flag in 
schema for soft delete though.
1:37
In my opinion, it is not about soft DELETE or hard DELETE, even if we do a soft 
DELETE, the event time (orderingVal) is still important for consumers for 
versoning. (edited) 

vc  1:57 PM
tbh, I don't see us fixing this in two days
1:58
lets do a 0.9.1 after this ?
1:58
shortly after with a bunch of bug fixes and the large pending PRs
1:58
we can even make it 0.10.0

Danny Chan  1:58 PM
Yes, the cut time is very soon. We can move the fix to next version.

vc  1:59 PM
We have some inconsistent semantics in places
1:59
some are commit time (arrival time) based and some are orderingVal (event time) 
based
2:00
In the meantime, see HoodieDeleteBlockVersion you can just define a new version 
for delete block alone for e,g
2:00
and add more information

  was:
The append handle now always write data block first then delete block, and the 
delete block only keeps the hoodie keys, when reading, the scanner just read 
the DELETE block without any info of ordering value, thus, if the we write two 
records:

insert: {id: 0, ts: 2}
delete: {id: 0, ts: 1}

Finally the insert message is deleted !!!, this is a critical bug for streaming 
write, we should fix it as soon as possible

Here is the discussion on slack:

Danny Chan  12:42 PM
https://issues.apache.org/jira/browse/HUDI-2299
12:43
Hi, @vc, our user found a critical bug for MOR log format, if there are 
disorder DELETEs in the streaming messages, the event time of the DELETEs are 
totally ignored.
12:44
I guess this should be a blocker of 0.9 because it affect the correctness of 
the data set.

vc  12:44 PM
if we can fix it by end of day friday PST
12:44
we can add it
12:44
Just want to cut a release this week.
12:45
Do you have a sense for the fix? bandwidth to take it up?

Danny Chan  12:46 PM
I try to fix it but can not figure out a good way, if the DELETE block records 
the orderingVal, the format breaks the compatibility.

vc  1:05 PM
We can version the format. thats doable. Should we precombine before even 
logging the deeltes?

Danny Chan  1:11 PM
Yes, we should

vc  1:26 PM
I think, thats how its working today. Deletes don't have an ordering val per 
se, right
1:28
Delete block at t1 :
  delete key k
Data block at t2 :
  ins key k with ordering val 2
We can just fix it so that the insert shows up, since t2 > t1.
For what kind of functionality you need, we need to do soft deletes i.e updates 
with an ordering value instead of hard deletes
1:28
makes sense?

Danny Chan  1:32 PM
we can but that’s not the perfect solution, especially if the dataset comes 
from a CDC source, for example the MySQL binlog. There is no extra flag in 
schema for soft delete though.
1:37
In my opinion, it is not about soft DELETE or hard DELETE, even if we do a soft 
DELETE, the event time (orderingVal) is still important for consumers for 
versoning. (edited) 

vc  1:57 PM
tbh, I don't see us fixing this in two days

[jira] [Updated] (HUDI-2299) The log format DELETE block lose the info orderingVal

2021-08-12 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-2299:
-
Description: 
The append handle now always write data block first then delete block, and the 
delete block only keeps the hoodie keys, when reading, the scanner just read 
the DELETE block without any info of ordering value, thus, if the we write two 
records:

insert: {id: 0, ts: 2}
delete: {id: 0, ts: 1}

Finally the insert message is deleted !!!, this is a critical bug for streaming 
write, we should fix it as soon as possible

Here is the discussion on slack:

Danny Chan  12:42 PM
https://issues.apache.org/jira/browse/HUDI-2299
12:43
Hi, @vc, our user found a critical bug for MOR log format, if there are 
disorder DELETEs in the streaming messages, the event time of the DELETEs are 
totally ignored.
12:44
I guess this should be a blocker of 0.9 because it affect the correctness of 
the data set.

vc  12:44 PM
if we can fix it by end of day friday PST
12:44
we can add it
12:44
Just want to cut a release this week.
12:45
Do you have a sense for the fix? bandwidth to take it up?

Danny Chan  12:46 PM
I try to fix it but can not figure out a good way, if the DELETE block records 
the orderingVal, the format breaks the compatibility.

vc  1:05 PM
We can version the format. thats doable. Should we precombine before even 
logging the deeltes?

Danny Chan  1:11 PM
Yes, we should

vc  1:26 PM
I think, thats how its working today. Deletes don't have an ordering val per 
se, right
1:28
Delete block at t1 :
  delete key k
Data block at t2 :
  ins key k with ordering val 2
We can just fix it so that the insert shows up, since t2 > t1.
For what kind of functionality you need, we need to do soft deletes i.e updates 
with an ordering value instead of hard deletes
1:28
makes sense?

Danny Chan  1:32 PM
we can but that’s not the perfect solution, especially if the dataset comes 
from a CDC source, for example the MySQL binlog. There is no extra flag in 
schema for soft delete though.
1:37
In my opinion, it is not about soft DELETE or hard DELETE, even if we do a soft 
DELETE, the event time (orderingVal) is still important for consumers for 
versoning. (edited) 

vc  1:57 PM
tbh, I don't see us fixing this in two days
1:58
lets do a 0.9.1 after this ?
1:58
shortly after with a bunch of bug fixes and the large pending PRs
1:58
we can even make it 0.10.0

Danny Chan  1:58 PM
Yes, the cut time is very soon. We can move the fix to next version.

vc  1:59 PM
We have some inconsistent semantics in places
1:59
some are commit time (arrival time) based and some are orderingVal (event time) 
based
2:00
In the meantime, see HoodieDeleteBlockVersion you can just define a new version 
for delete block alone for e,g
2:00
and add more information

  was:
The append handle now always write data block first then delete block, and the 
delete block only keeps the hoodie keys, when reading, the scanner just read 
the DELETE block without any info of ordering value, thus, if the we write two 
records:

insert: {id: 0, ts: 2}
delete: {id: 0, ts: 1}

Finally the insert message is deleted !!!, this is a critical bug for streaming 
write, we should fix it as soon as possible


> The log format DELETE block lose the info orderingVal
> -
>
> Key: HUDI-2299
> URL: https://issues.apache.org/jira/browse/HUDI-2299
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.10.0
>
>
> The append handle now always write data block first then delete block, and 
> the delete block only keeps the hoodie keys, when reading, the scanner just 
> read the DELETE block without any info of ordering value, thus, if the we 
> write two records:
> insert: {id: 0, ts: 2}
> delete: {id: 0, ts: 1}
> Finally the insert message is deleted !!!, this is a critical bug for 
> streaming write, we should fix it as soon as possible
> Here is the discussion on slack:
> Danny Chan  12:42 PM
> https://issues.apache.org/jira/browse/HUDI-2299
> 12:43
> Hi, @vc, our user found a critical bug for MOR log format, if there are 
> disorder DELETEs in the streaming messages, the event time of the DELETEs are 
> totally ignored.
> 12:44
> I guess this should be a blocker of 0.9 because it affect the correctness of 
> the data set.
> vc  12:44 PM
> if we can fix it by end of day friday PST
> 12:44
> we can add it
> 12:44
> Just want to cut a release this week.
> 12:45
> Do you have a sense for the fix? bandwidth to take it up?
> Danny Chan  12:46 PM
> I try to fix it but can not figure out a good way, if the DELETE block 
> records the orderingVal, the format breaks the compatibility.
> vc  1:05 PM
> We can version the format. thats doable. Should we precombine before even 
> logging the deel

[jira] [Resolved] (HUDI-2250) [SQL] Bulk insert support for tables w/ primary key

2021-08-12 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-2250.
---
Fix Version/s: (was: 0.10.0)
   0.9.0
   Resolution: Fixed

> [SQL] Bulk insert support for tables w/ primary key
> ---
>
> Key: HUDI-2250
> URL: https://issues.apache.org/jira/browse/HUDI-2250
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: release-blocker
> Fix For: 0.9.0
>
>
> we want to support bulk insert for any table. Right now, we have a constraint 
> that only tables w/o any primary key can be bulk_inserted. 
>  
>          > 
>          > set hoodie.sql.bulk.insert.enable = true;
> hoodie.sql.bulk.insert.enable true
> Time taken: 2.019 seconds, Fetched 1 row(s)
> spark-sql> set hoodie.datasource.write.row.writer.enable = true;
> hoodie.datasource.write.row.writer.enable true
> Time taken: 0.026 seconds, Fetched 1 row(s)
> spark-sql> 
>          > 
>          > create table hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>          >   type = 'cow', 
>          >   primaryKey = 'randomId', 
>          >   preCombineField = 'date_col' 
>          >  ) 
>          > partitioned by (type) as select * from gh_17Gb_date_col;
> 21/07/29 04:26:15 ERROR SparkSQLDriver: Failed in [create table 
> hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>   type = 'cow', 
>   primaryKey = 'randomId', 
>   preCombineField = 'date_col' 
>  ) 
> partitioned by (type) as select * from gh_17Gb_date_col]
> java.lang.IllegalArgumentException: Table with primaryKey can not use bulk 
> insert.
> at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.buildHoodieInsertConfig(InsertIntoHoodieTableCommand.scala:219)
> at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:78)
> at 
> org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:86)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2250) [SQL] Bulk insert support for tables w/ primary key

2021-08-12 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-2250:
-

Assignee: pengzhiwei

> [SQL] Bulk insert support for tables w/ primary key
> ---
>
> Key: HUDI-2250
> URL: https://issues.apache.org/jira/browse/HUDI-2250
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: release-blocker
> Fix For: 0.10.0
>
>
> we want to support bulk insert for any table. Right now, we have a constraint 
> that only tables w/o any primary key can be bulk_inserted. 
>  
>          > 
>          > set hoodie.sql.bulk.insert.enable = true;
> hoodie.sql.bulk.insert.enable true
> Time taken: 2.019 seconds, Fetched 1 row(s)
> spark-sql> set hoodie.datasource.write.row.writer.enable = true;
> hoodie.datasource.write.row.writer.enable true
> Time taken: 0.026 seconds, Fetched 1 row(s)
> spark-sql> 
>          > 
>          > create table hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>          >   type = 'cow', 
>          >   primaryKey = 'randomId', 
>          >   preCombineField = 'date_col' 
>          >  ) 
>          > partitioned by (type) as select * from gh_17Gb_date_col;
> 21/07/29 04:26:15 ERROR SparkSQLDriver: Failed in [create table 
> hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>   type = 'cow', 
>   primaryKey = 'randomId', 
>   preCombineField = 'date_col' 
>  ) 
> partitioned by (type) as select * from gh_17Gb_date_col]
> java.lang.IllegalArgumentException: Table with primaryKey can not use bulk 
> insert.
> at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.buildHoodieInsertConfig(InsertIntoHoodieTableCommand.scala:219)
> at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:78)
> at 
> org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:86)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2250) [SQL] Bulk insert support for tables w/ primary key

2021-08-12 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2250:
--
Status: In Progress  (was: Open)

> [SQL] Bulk insert support for tables w/ primary key
> ---
>
> Key: HUDI-2250
> URL: https://issues.apache.org/jira/browse/HUDI-2250
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: release-blocker
> Fix For: 0.10.0
>
>
> we want to support bulk insert for any table. Right now, we have a constraint 
> that only tables w/o any primary key can be bulk_inserted. 
>  
>          > 
>          > set hoodie.sql.bulk.insert.enable = true;
> hoodie.sql.bulk.insert.enable true
> Time taken: 2.019 seconds, Fetched 1 row(s)
> spark-sql> set hoodie.datasource.write.row.writer.enable = true;
> hoodie.datasource.write.row.writer.enable true
> Time taken: 0.026 seconds, Fetched 1 row(s)
> spark-sql> 
>          > 
>          > create table hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>          >   type = 'cow', 
>          >   primaryKey = 'randomId', 
>          >   preCombineField = 'date_col' 
>          >  ) 
>          > partitioned by (type) as select * from gh_17Gb_date_col;
> 21/07/29 04:26:15 ERROR SparkSQLDriver: Failed in [create table 
> hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>   type = 'cow', 
>   primaryKey = 'randomId', 
>   preCombineField = 'date_col' 
>  ) 
> partitioned by (type) as select * from gh_17Gb_date_col]
> java.lang.IllegalArgumentException: Table with primaryKey can not use bulk 
> insert.
> at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.buildHoodieInsertConfig(InsertIntoHoodieTableCommand.scala:219)
> at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:78)
> at 
> org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:86)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1292) [Umbrella] RFC-15 : File Listing and Query Planning Optimizations

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398383#comment-17398383
 ] 

ASF GitHub Bot commented on HUDI-1292:
--

danny0405 commented on pull request #3427:
URL: https://github.com/apache/hudi/pull/3427#issuecomment-898101210


   +1 to @leesf , these two config options confuses us a lot, not to say the 
user.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Umbrella] RFC-15 : File Listing and Query Planning Optimizations 
> --
>
> Key: HUDI-1292
> URL: https://issues.apache.org/jira/browse/HUDI-1292
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration, Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Major
>  Labels: hudi-umbrellas, pull-request-available
> Fix For: 0.10.0
>
>
> This is the umbrella ticket that tracks the overall implementation of RFC-15



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] danny0405 commented on pull request #3427: [HUDI-1292] Created a config to enable/disable syncing of metadata table.

2021-08-12 Thread GitBox


danny0405 commented on pull request #3427:
URL: https://github.com/apache/hudi/pull/3427#issuecomment-898101210


   +1 to @leesf , these two config options confuses us a lot, not to say the 
user.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2119) Syncing of rollbacks to metadata table does not work in all cases

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398372#comment-17398372
 ] 

ASF GitHub Bot commented on HUDI-2119:
--

hudi-bot edited a comment on pull request #3210:
URL: https://github.com/apache/hudi/pull/3210#issuecomment-872541566


   
   ## CI report:
   
   * a9dcb727c23272b2c8b74647b467f413b9e83f5d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1698)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Syncing of rollbacks to metadata table does not work in all cases
> -
>
> Key: HUDI-2119
> URL: https://issues.apache.org/jira/browse/HUDI-2119
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> This is an issue with inline automatic rollbacks.
> Metadata table assumes that a rollbacks is to be applied if the 
> instant-being-rolled back has a timestamp less than the last deltacommit time 
> on the metadata timeline. We do not explicitly check if the 
> instant-being-rolled-back was actually written to metadata table.
> **A rollback adds a record to metadata table which "deletes" files from a 
> failed/earlier commit. If the files being deleted were never actually 
> committed to metadata table earlier, the deletes cannot be consolidated 
> during metadata table reads. This leads to a HoodieMetadataException as we 
> cannot differentiate this from a bug where we might have missed committing a 
> commit to metadata table.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot edited a comment on pull request #3210: [HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant.

2021-08-12 Thread GitBox


hudi-bot edited a comment on pull request #3210:
URL: https://github.com/apache/hudi/pull/3210#issuecomment-872541566


   
   ## CI report:
   
   * a9dcb727c23272b2c8b74647b467f413b9e83f5d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1698)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1507) Hive sync having issues w/ Clustering

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1507:

Fix Version/s: 0.8.0

> Hive sync having issues w/ Clustering
> -
>
> Key: HUDI-1507
> URL: https://issues.apache.org/jira/browse/HUDI-1507
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Storage Management
>Affects Versions: 0.7.0
>Reporter: sivabalan narayanan
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available, release-blocker
> Fix For: 0.8.0
>
>
> I was trying out clustering w/ test suite job and ran into hive sync issues.
>  
> 21/01/05 16:45:05 WARN DagNode: Executing ClusteringNode node 
> 5522853c-653b-4d92-acf4-d299c263a77f
> 21/01/05 16:45:05 WARN AbstractHoodieWriteClient: Scheduling clustering at 
> instant time :20210105164505 clustering strategy 
> org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy,
>  clustering sort cols : _row_key, target partitions for clustering :: 0, 
> inline cluster max commit : 1
> 21/01/05 16:45:05 WARN HoodieTestSuiteWriter: Clustering instant :: 
> 20210105164505
> 21/01/05 16:45:22 WARN DagScheduler: Executing node "second_hive_sync" :: 
> \{"queue_name":"adhoc","engine":"mr","name":"80325009-bb92-4df5-8c34-71bd75d001b8","config":"second_hive_sync"}
> 21/01/05 16:45:22 ERROR HiveSyncTool: Got runtime exception when hive syncing
> org.apache.hudi.exception.HoodieIOException: unknown action in timeline 
> replacecommit
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getAffectedPartitions$1(TimelineUtils.java:99)
>  at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>  at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>  at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>  at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.getAffectedPartitions(TimelineUtils.java:102)
>  at 
> org.apache.hudi.common.table.timeline.TimelineUtils.getPartitionsWritten(TimelineUtils.java:50)
>  at 
> org.apache.hudi.sync.common.AbstractSyncHoodieClient.getPartitionsWrittenToSince(AbstractSyncHoodieClient.java:136)
>  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:145)
>  at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:589)
>  at 
> org.apache.hudi.integ.testsuite.helpers.HiveServiceProvider.syncToLocalHiveIfNeeded(HiveServiceProvider.java:53)
>  at 
> org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode.execute(HiveSyncNode.java:41)
>  at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.executeNode(DagScheduler.java:139)
>  at 
> org.apache.hudi.integ.testsuite.dag.scheduler.DagScheduler.lambda$execute$0(DagScheduler.java:105)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1901) Not an Avro data file during archive

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1901:

Fix Version/s: 0.9.0

> Not an Avro data file during archive
> 
>
> Key: HUDI-1901
> URL: https://issues.apache.org/jira/browse/HUDI-1901
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Gary Li
>Assignee: Gary Li
>Priority: Blocker
> Fix For: 0.9.0
>
>
> https://github.com/apache/hudi/issues/2944



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-2119) Syncing of rollbacks to metadata table does not work in all cases

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398361#comment-17398361
 ] 

ASF GitHub Bot commented on HUDI-2119:
--

hudi-bot edited a comment on pull request #3210:
URL: https://github.com/apache/hudi/pull/3210#issuecomment-872541566


   
   ## CI report:
   
   * 23f261c5ebd97be2b7bd5cc9bb9c536c8866da57 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1694)
 
   * a9dcb727c23272b2c8b74647b467f413b9e83f5d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1698)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Syncing of rollbacks to metadata table does not work in all cases
> -
>
> Key: HUDI-2119
> URL: https://issues.apache.org/jira/browse/HUDI-2119
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> This is an issue with inline automatic rollbacks.
> Metadata table assumes that a rollbacks is to be applied if the 
> instant-being-rolled back has a timestamp less than the last deltacommit time 
> on the metadata timeline. We do not explicitly check if the 
> instant-being-rolled-back was actually written to metadata table.
> **A rollback adds a record to metadata table which "deletes" files from a 
> failed/earlier commit. If the files being deleted were never actually 
> committed to metadata table earlier, the deletes cannot be consolidated 
> during metadata table reads. This leads to a HoodieMetadataException as we 
> cannot differentiate this from a bug where we might have missed committing a 
> commit to metadata table.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot edited a comment on pull request #3210: [HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant.

2021-08-12 Thread GitBox


hudi-bot edited a comment on pull request #3210:
URL: https://github.com/apache/hudi/pull/3210#issuecomment-872541566


   
   ## CI report:
   
   * 23f261c5ebd97be2b7bd5cc9bb9c536c8866da57 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1694)
 
   * a9dcb727c23272b2c8b74647b467f413b9e83f5d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1698)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1764) Add support for Hudi CLI tools to schedule and run clustering

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1764:

Fix Version/s: 0.9.0

> Add support for Hudi CLI tools to schedule and run clustering
> -
>
> Key: HUDI-1764
> URL: https://issues.apache.org/jira/browse/HUDI-1764
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: CLI
>Reporter: Jintao
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently, Hudi CLI doesn't have the capability to schedule or run clustering.
> We would like to add it to Hudi CLI tools.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2004) Move KafkaOffsetGen.CheckpointUtils test cases to independent class and improve coverage

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2004:

Fix Version/s: 0.9.0

> Move KafkaOffsetGen.CheckpointUtils test cases to independent class and 
> improve coverage
> 
>
> Key: HUDI-2004
> URL: https://issues.apache.org/jira/browse/HUDI-2004
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Vinay
>Assignee: Vinay
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently KafkaOffsetGen.CheckpointUtils test cases are present in 
> TestKafkaSource which starts up hdfs, hive,zk service locally. This is not 
> required for CheckpointUtils test cases, hence should be moved to independent 
> test case of its own
>  
> Also, .CheckpointUtils.strToOffsets and CheckpointUtils.offsetsToStr are not 
> unit tested currently



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-2119) Syncing of rollbacks to metadata table does not work in all cases

2021-08-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398360#comment-17398360
 ] 

ASF GitHub Bot commented on HUDI-2119:
--

hudi-bot edited a comment on pull request #3210:
URL: https://github.com/apache/hudi/pull/3210#issuecomment-872541566


   
   ## CI report:
   
   * 23f261c5ebd97be2b7bd5cc9bb9c536c8866da57 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1694)
 
   * a9dcb727c23272b2c8b74647b467f413b9e83f5d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Syncing of rollbacks to metadata table does not work in all cases
> -
>
> Key: HUDI-2119
> URL: https://issues.apache.org/jira/browse/HUDI-2119
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> This is an issue with inline automatic rollbacks.
> Metadata table assumes that a rollbacks is to be applied if the 
> instant-being-rolled back has a timestamp less than the last deltacommit time 
> on the metadata timeline. We do not explicitly check if the 
> instant-being-rolled-back was actually written to metadata table.
> **A rollback adds a record to metadata table which "deletes" files from a 
> failed/earlier commit. If the files being deleted were never actually 
> committed to metadata table earlier, the deletes cannot be consolidated 
> during metadata table reads. This leads to a HoodieMetadataException as we 
> cannot differentiate this from a bug where we might have missed committing a 
> commit to metadata table.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot edited a comment on pull request #3210: [HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant.

2021-08-12 Thread GitBox


hudi-bot edited a comment on pull request #3210:
URL: https://github.com/apache/hudi/pull/3210#issuecomment-872541566


   
   ## CI report:
   
   * 23f261c5ebd97be2b7bd5cc9bb9c536c8866da57 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1694)
 
   * a9dcb727c23272b2c8b74647b467f413b9e83f5d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2159) Supporting Clustering and Metadata Table together

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2159:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Supporting Clustering and Metadata Table together
> -
>
> Key: HUDI-2159
> URL: https://issues.apache.org/jira/browse/HUDI-2159
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Blocker
> Fix For: 0.10.0
>
>
> I am testing clustering support for metadata enabled table and found a few 
> issues.
> *Setup*
> Pipeline 1: Ingestion pipeline with Metadata Table enabled. Runs every 30 
> mins. 
> Pipeline 2: Clustering pipeline with long running jobs (3-4 hours)
> Pipeline 3: Another clustering pipeline with long running jobs (3-4 hours)
>  
> *Issue #1: Parallel commits on Metadata Table*
> Assume the Clustering pipeline is completing T5.replacecommit and ingestion 
> pipeline is completing T10.commit. Metadata Table will synced at an instant 
>  Now both the pipelines will call syncMetadataTable() which will do the 
> following:
>  # Find all un-synced instants from dataset (T5, T6 ... T10)
>  # Read each instant and perform a deltacommit on the Metadata Table with the 
> same timestamp as instant.
> There is a chance that two processed perform deltacommit at T5 on the 
> metadata table and one will fail (instant file already exists). This will be 
> an exception raised and will be detected as failure of pipeline leading to 
> false-positive alerts.
>  
> *Issue #2: No archiving/rollback support for failed clustering operations*
> If a clustering operation fails, it leaves a left-over 
> T5.replacecommit.inflight. There is no automated way to rollback or archive 
> these. Since clustering is a long running operation in general and may be run 
> through multiple pipelines at the same time, automated rollback of left-over 
> inflights doesnt work as we cannot be sure that the process is dead.
> Metadata Table sync only works in completion order. So if 
> T5.replacecommit.inflight is left-over, Metadata Table will not sync beyond 
> T5 causing a large number of LogBLocks to pile up which will have to be 
> merged in memory leading to deteriorating performance.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1290) Implement Debezium avro source for Delta Streamer

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1290:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Implement Debezium avro source for Delta Streamer
> -
>
> Key: HUDI-1290
> URL: https://issues.apache.org/jira/browse/HUDI-1290
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Affects Versions: 0.9.0
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.10.0
>
>
> We need to implement transformer and payloads for seamlessly pulling change 
> logs emitted by debezium in Kafka. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1309:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Listing Metadata unreadable in S3 as the log block is deemed corrupted
> --
>
> Key: HUDI-1309
> URL: https://issues.apache.org/jira/browse/HUDI-1309
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Nishith Agarwal
>Priority: Blocker
> Fix For: 0.10.0
>
>
> When running metadata list-partitions CLI command, I am seeing the below 
> messages and the partition list is empty. Was expecting 10K partitions.
>  
> {code:java}
>  36589 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning 
> log file 
> HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0}
>  36590 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block 
> in file 
> HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} with block size(3723305) running past EOF
>  36684 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Log 
> HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} has a corrupted block at 14
>  44515 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block 
> in 
> HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} starts at 3723319
>  44566 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a 
> corrupt block in 
> s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045
>  44567 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1537) Move validation of file listings to something that happens before each write

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1537:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Move validation of file listings to something that happens before each write
> 
>
> Key: HUDI-1537
> URL: https://issues.apache.org/jira/browse/HUDI-1537
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Current way of checking, has issues dealing with log files and inflight 
> files. Code has comments. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1500) Support incrementally reading clustering commit via Spark Datasource/DeltaStreamer

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1500:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Support incrementally reading clustering  commit via Spark 
> Datasource/DeltaStreamer
> ---
>
> Key: HUDI-1500
> URL: https://issues.apache.org/jira/browse/HUDI-1500
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer, Spark Integration
>Reporter: liwei
>Assignee: satish
>Priority: Blocker
> Fix For: 0.10.0
>
>
> now in DeltaSync.readFromSource() can  not read last instant as replace 
> commit, such as clustering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1492) Handle DeltaWriteStat correctly for storage schemes that support appends

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1492:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Handle DeltaWriteStat correctly for storage schemes that support appends
> 
>
> Key: HUDI-1492
> URL: https://issues.apache.org/jira/browse/HUDI-1492
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Current implementation simply uses the
> {code:java}
> String pathWithPartition = hoodieWriteStat.getPath(); {code}
> to write the metadata table. this is problematic, if the delta write was 
> merely an append. and can technically add duplicate files into the metadata 
> table 
> (not sure if this is a problem per se. but filing a Jira to track and either 
> close/fix ) 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1542) Fix Flaky test : TestHoodieMetadata#testSync

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1542:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Fix Flaky test : TestHoodieMetadata#testSync
> 
>
> Key: HUDI-1542
> URL: https://issues.apache.org/jira/browse/HUDI-1542
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Only fails intermittently on CI.
> {code}
> [INFO] Running org.apache.hudi.metadata.TestHoodieBackedMetadata
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/travis/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/travis/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.6.2/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> [WARN ] 2021-01-20 09:25:31,716 org.apache.spark.util.Utils  - Your hostname, 
> localhost resolves to a loopback address: 127.0.0.1; using 10.30.0.81 instead 
> (on interface eth0)
> [WARN ] 2021-01-20 09:25:31,725 org.apache.spark.util.Utils  - Set 
> SPARK_LOCAL_IP if you need to bind to another address
> [WARN ] 2021-01-20 09:25:32,412 org.apache.hadoop.util.NativeCodeLoader  - 
> Unable to load native-hadoop library for your platform... using builtin-java 
> classes where applicable
> [WARN ] 2021-01-20 09:25:36,645 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit6813339032540265368/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:25:36,700 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit6813339032540265368/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:26:30,250 
> org.apache.hudi.client.AbstractHoodieWriteClient  - Cannot find instant 
> 20210120092628 in the timeline, for rollback
> [WARN ] 2021-01-20 09:26:45,980 
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> [WARN ] 2021-01-20 09:26:46,568 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit5544286531112563801/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:26:46,580 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit5544286531112563801/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:27:27,853 
> org.apache.hudi.client.AbstractHoodieWriteClient  - Cannot find instant 
> 20210120092726 in the timeline, for rollback
> [WARN ] 2021-01-20 09:27:43,037 
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> [WARN ] 2021-01-20 09:27:46,017 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit3284615140376500245/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:28:05,357 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092805__replacecommit__REQUESTED]
> [WARN ] 2021-01-20 09:28:05,887 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092805__replacecommit__INFLIGHT]
> [WARN ] 2021-01-20 09:28:06,312 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092805__replacecommit__INFLIGHT]
> [WARN ] 2021-01-20 09:28:18,402 
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> [WARN ] 2021-01-20 09:28:22,013 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit4284626513859445824/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:28:40,354 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092840__replacecommit__REQUESTED]
> [WARN ] 2021-01-20 09:28:40,780 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092840__replacecommit__INFLIGHT]
> [WARN ] 2021-01-20 09:28:41,162 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092840__replacecommit__INFLIGHT]
> =[ 605 seconds still running ]=
> [ERROR] 2021-01-20 09:28:50,683 
> org.apache.hudi.timeline.service.FileSystemViewHandler  - Got runtime 
> exception servicing request 
> 

[jira] [Updated] (HUDI-1706) Test flakiness w/ multiwriter test

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1706:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Test flakiness w/ multiwriter test
> --
>
> Key: HUDI-1706
> URL: https://issues.apache.org/jira/browse/HUDI-1706
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
> Fix For: 0.10.0
>
>
> [https://api.travis-ci.com/v3/job/492130170/log.txt]
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1839) FSUtils getAllPartitions broken by NotSerializableException: org.apache.hadoop.fs.Path

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1839:

Fix Version/s: (was: 0.9.0)
   0.10.0

> FSUtils getAllPartitions broken by NotSerializableException: 
> org.apache.hadoop.fs.Path
> --
>
> Key: HUDI-1839
> URL: https://issues.apache.org/jira/browse/HUDI-1839
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: satish
>Priority: Blocker
> Fix For: 0.10.0
>
>
> FSUtils getAllPartitionPaths is expected to work if metadata table is enabled 
> or not. It can also be called inside spark context. But looks like we are 
> trying to improve parallelism and causing NotSerializableExceptions. There 
> are multiple callers using it within spark context (clustering/cleaner).
> See stack trace below
> 21/04/20 17:28:44 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with FAILED (diag message: User class threw exception: 
> org.apache.hudi.exception.HoodieException: Error fetching partition paths 
> from metadata table
>  at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:321)
>  at 
> org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy.generateClusteringPlan(PartitionAwareClusteringPlanStrategy.java:67)
>  at 
> org.apache.hudi.table.action.cluster.SparkClusteringPlanActionExecutor.createClusteringPlan(SparkClusteringPlanActionExecutor.java:71)
>  at 
> org.apache.hudi.table.action.cluster.BaseClusteringPlanActionExecutor.execute(BaseClusteringPlanActionExecutor.java:56)
>  at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleClustering(HoodieSparkCopyOnWriteTable.java:160)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClusteringAtInstant(AbstractHoodieWriteClient.java:873)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClustering(AbstractHoodieWriteClient.java:861)
>  at 
> com.uber.data.efficiency.hudi.HudiRewriter.rewriteDataUsingHudi(HudiRewriter.java:111)
>  at com.uber.data.efficiency.hudi.HudiRewriter.main(HudiRewriter.java:50)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
>  Caused by: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Failed to serialize task 53, not attempting to retry it. Exception 
> during serialization: java.io.NotSerializableException: 
> org.apache.hadoop.fs.Path
>  Serialization stack:
>  - object not serializable (class: org.apache.hadoop.fs.Path, value: 
> hdfs://...)
>  - element of array (index: 0)
>  - array (class [Ljava.lang.Object;, size 1)
>  - field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, 
> type: class [Ljava.lang.Object;)
>  - object (class scala.collection.mutable.WrappedArray$ofRef, 
> WrappedArray(hdfs://...))
>  - writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition)
>  - object (class org.apache.spark.rdd.ParallelCollectionPartition, 
> org.apache.spark.rdd.ParallelCollectionPartition@735)
>  - field (class: org.apache.spark.scheduler.ResultTask, name: partition, 
> type: interface org.apache.spark.Partition)
>  - object (class org.apache.spark.scheduler.ResultTask, ResultTask(1, 0))
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1904)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1892)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1891)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1891)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2125)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2074)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:20

[jira] [Updated] (HUDI-1912) Presto defaults to GenericHiveRecordCursor for all Hudi tables

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1912:

Status: In Progress  (was: Open)

> Presto defaults to GenericHiveRecordCursor for all Hudi tables
> --
>
> Key: HUDI-1912
> URL: https://issues.apache.org/jira/browse/HUDI-1912
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Presto Integration
>Affects Versions: 0.7.0
>Reporter: satish
>Priority: Blocker
> Fix For: 0.9.0
>
>
> See code here 
> https://github.com/prestodb/presto/blob/2ad67dcf000be86ebc5ff7732bbb9994c8e324a8/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java#L168
> Starting Hudi 0.7, HoodieInputFormat comes with 
> UseRecordReaderFromInputFormat annotation. As a result, we are skipping all 
> optimizations in parquet PageSource and using basic GenericHiveRecordCursor 
> which has several limitations:
> 1) No support for timestamp
> 2) No support for synthesized columns
> 3) No support for vectorized reading?
> Example errors we saw:
> Error#1
> {code}
> java.lang.IllegalStateException: column type must be regular
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:507)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursor.(GenericHiveRecordCursor.java:167)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.createRecordCursor(GenericHiveRecordCursorProvider.java:79)
>   at 
> com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:449)
>   at 
> com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:177)
>   at 
> com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:63)
>   at 
> com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:80)
>   at 
> com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:231)
>   at com.facebook.presto.operator.Driver.processInternal(Driver.java:418)
>   at 
> com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:301)
>   at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:722)
>   at com.facebook.presto.operator.Driver.processFor(Driver.java:294)
>   at 
> com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1077)
>   at 
> com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
>   at 
> com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:545)
>   at 
> com.facebook.presto.$gen.Presto_0_247_17f857e20210506_210241_1.run(Unknown
>  Source)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:834) 
> {code}
> Error#2
> {code}
> java.lang.ClassCastException: class org.apache.hadoop.io.LongWritable cannot 
> be cast to class org.apache.hadoop.hive.serde2.io.TimestampWritable 
> (org.apache.hadoop.io.LongWritable and 
> org.apache.hadoop.hive.serde2.io.TimestampWritable are in unnamed module of 
> loader com.facebook.presto.server.PluginClassLoader @5c4e86e7)
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(WritableTimestampObjectInspector.java:39)
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(WritableTimestampObjectInspector.java:25)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursor.parseLongColumn(GenericHiveRecordCursor.java:286)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursor.parseColumn(GenericHiveRecordCursor.java:550)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursor.isNull(GenericHiveRecordCursor.java:508)
>   at 
> com.facebook.presto.hive.HiveRecordCursor.isNull(HiveRecordCursor.java:233)
>   at 
> com.facebook.presto.spi.RecordPageSource.getNextPage(RecordPageSource.java:112)
>   at 
> com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:251)
>   at com.facebook.presto.operator.Driver.processInternal(Driver.java:418)
>   at 
> com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:301)
>   at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:722)
>   at com.facebook.presto.operator.Driver.processFor(Driver.java:294)
>   at 
> com.facebook.presto.execution.SqlTaskExecu

[jira] [Updated] (HUDI-1912) Presto defaults to GenericHiveRecordCursor for all Hudi tables

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1912:

Fix Version/s: (was: 0.9.0)
   0.7.0

> Presto defaults to GenericHiveRecordCursor for all Hudi tables
> --
>
> Key: HUDI-1912
> URL: https://issues.apache.org/jira/browse/HUDI-1912
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Presto Integration
>Affects Versions: 0.7.0
>Reporter: satish
>Priority: Blocker
> Fix For: 0.7.0
>
>
> See code here 
> https://github.com/prestodb/presto/blob/2ad67dcf000be86ebc5ff7732bbb9994c8e324a8/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java#L168
> Starting Hudi 0.7, HoodieInputFormat comes with 
> UseRecordReaderFromInputFormat annotation. As a result, we are skipping all 
> optimizations in parquet PageSource and using basic GenericHiveRecordCursor 
> which has several limitations:
> 1) No support for timestamp
> 2) No support for synthesized columns
> 3) No support for vectorized reading?
> Example errors we saw:
> Error#1
> {code}
> java.lang.IllegalStateException: column type must be regular
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:507)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursor.(GenericHiveRecordCursor.java:167)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.createRecordCursor(GenericHiveRecordCursorProvider.java:79)
>   at 
> com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:449)
>   at 
> com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:177)
>   at 
> com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:63)
>   at 
> com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:80)
>   at 
> com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:231)
>   at com.facebook.presto.operator.Driver.processInternal(Driver.java:418)
>   at 
> com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:301)
>   at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:722)
>   at com.facebook.presto.operator.Driver.processFor(Driver.java:294)
>   at 
> com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1077)
>   at 
> com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
>   at 
> com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:545)
>   at 
> com.facebook.presto.$gen.Presto_0_247_17f857e20210506_210241_1.run(Unknown
>  Source)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:834) 
> {code}
> Error#2
> {code}
> java.lang.ClassCastException: class org.apache.hadoop.io.LongWritable cannot 
> be cast to class org.apache.hadoop.hive.serde2.io.TimestampWritable 
> (org.apache.hadoop.io.LongWritable and 
> org.apache.hadoop.hive.serde2.io.TimestampWritable are in unnamed module of 
> loader com.facebook.presto.server.PluginClassLoader @5c4e86e7)
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(WritableTimestampObjectInspector.java:39)
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(WritableTimestampObjectInspector.java:25)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursor.parseLongColumn(GenericHiveRecordCursor.java:286)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursor.parseColumn(GenericHiveRecordCursor.java:550)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursor.isNull(GenericHiveRecordCursor.java:508)
>   at 
> com.facebook.presto.hive.HiveRecordCursor.isNull(HiveRecordCursor.java:233)
>   at 
> com.facebook.presto.spi.RecordPageSource.getNextPage(RecordPageSource.java:112)
>   at 
> com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:251)
>   at com.facebook.presto.operator.Driver.processInternal(Driver.java:418)
>   at 
> com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:301)
>   at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:722)
>   at com.facebook.presto.operator.Driver.processFor(Driver.java:294)
>   at 
> com.facebook.prest

[jira] [Updated] (HUDI-1937) When clustering fail, generating unfinished replacecommit timeline.

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1937:

Fix Version/s: (was: 0.9.0)
   0.10.0

> When clustering fail, generating unfinished replacecommit timeline.
> ---
>
> Key: HUDI-1937
> URL: https://issues.apache.org/jira/browse/HUDI-1937
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Affects Versions: 0.8.0
>Reporter: taylor liao
>Assignee: liwei
>Priority: Blocker
> Fix For: 0.10.0
>
>
> When clustering fail, generating unfinished replacecommit.
>  Restart job will generate delta commit. if the commit contain clustering 
> group file, the task will fail.
>  "Not allowed to update the clustering file group %s
>  For pending clustering operations, we are not going to support update for 
> now."
>  Need to ensure that the unfinished replacecommit file is deleted, or perform 
> clustering first, and then generate delta commit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1970) Performance testing/certification of key SQL DMLs

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1970:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Performance testing/certification of key SQL DMLs
> -
>
> Key: HUDI-1970
> URL: https://issues.apache.org/jira/browse/HUDI-1970
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance, Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-2063) [SQL] Add Doc For Spark Sql Integrates With Hudi

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-2063.
-
Resolution: Fixed

> [SQL] Add Doc For Spark Sql Integrates With Hudi
> 
>
> Key: HUDI-2063
> URL: https://issues.apache.org/jira/browse/HUDI-2063
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1958) [Umbrella] Follow up items from 1 pass over GH issues

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1958:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Umbrella] Follow up items from 1 pass over GH issues
> -
>
> Key: HUDI-1958
> URL: https://issues.apache.org/jira/browse/HUDI-1958
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: release-blocker
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2063) [SQL] Add Doc For Spark Sql Integrates With Hudi

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2063:

Status: In Progress  (was: Open)

> [SQL] Add Doc For Spark Sql Integrates With Hudi
> 
>
> Key: HUDI-2063
> URL: https://issues.apache.org/jira/browse/HUDI-2063
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2119) Syncing of rollbacks to metadata table does not work in all cases

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2119:

Status: In Progress  (was: Open)

> Syncing of rollbacks to metadata table does not work in all cases
> -
>
> Key: HUDI-2119
> URL: https://issues.apache.org/jira/browse/HUDI-2119
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> This is an issue with inline automatic rollbacks.
> Metadata table assumes that a rollbacks is to be applied if the 
> instant-being-rolled back has a timestamp less than the last deltacommit time 
> on the metadata timeline. We do not explicitly check if the 
> instant-being-rolled-back was actually written to metadata table.
> **A rollback adds a record to metadata table which "deletes" files from a 
> failed/earlier commit. If the files being deleted were never actually 
> committed to metadata table earlier, the deletes cannot be consolidated 
> during metadata table reads. This leads to a HoodieMetadataException as we 
> cannot differentiate this from a bug where we might have missed committing a 
> commit to metadata table.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2120) Update docs about schema in flink sql configuration

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2120:

Status: In Progress  (was: Open)

> Update docs about schema in flink sql configuration
> ---
>
> Key: HUDI-2120
> URL: https://issues.apache.org/jira/browse/HUDI-2120
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2250) [SQL] Bulk insert support for tables w/ primary key

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-2250:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [SQL] Bulk insert support for tables w/ primary key
> ---
>
> Key: HUDI-2250
> URL: https://issues.apache.org/jira/browse/HUDI-2250
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Priority: Blocker
>  Labels: release-blocker
> Fix For: 0.10.0
>
>
> we want to support bulk insert for any table. Right now, we have a constraint 
> that only tables w/o any primary key can be bulk_inserted. 
>  
>          > 
>          > set hoodie.sql.bulk.insert.enable = true;
> hoodie.sql.bulk.insert.enable true
> Time taken: 2.019 seconds, Fetched 1 row(s)
> spark-sql> set hoodie.datasource.write.row.writer.enable = true;
> hoodie.datasource.write.row.writer.enable true
> Time taken: 0.026 seconds, Fetched 1 row(s)
> spark-sql> 
>          > 
>          > create table hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>          >   type = 'cow', 
>          >   primaryKey = 'randomId', 
>          >   preCombineField = 'date_col' 
>          >  ) 
>          > partitioned by (type) as select * from gh_17Gb_date_col;
> 21/07/29 04:26:15 ERROR SparkSQLDriver: Failed in [create table 
> hudi_17Gb_ext1 using hudi location 
> 's3a://siva-test-bucket-june-16/hudi_testing/gh_arch_dump/hudi_5/' options ( 
>   type = 'cow', 
>   primaryKey = 'randomId', 
>   preCombineField = 'date_col' 
>  ) 
> partitioned by (type) as select * from gh_17Gb_date_col]
> java.lang.IllegalArgumentException: Table with primaryKey can not use bulk 
> insert.
> at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.buildHoodieInsertConfig(InsertIntoHoodieTableCommand.scala:219)
> at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:78)
> at 
> org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:86)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1239) [UMBRELLA] Config clean up

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1239:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [UMBRELLA] Config clean up
> --
>
> Key: HUDI-1239
> URL: https://issues.apache.org/jira/browse/HUDI-1239
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: sivabalan narayanan
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: hudi-umbrellas
> Fix For: 0.10.0
>
>
> Tracks all efforts to clean up configs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1237) [UMBRELLA] Checkstyle, formatting, warnings, spotless

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1237:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [UMBRELLA] Checkstyle, formatting, warnings, spotless
> -
>
> Key: HUDI-1237
> URL: https://issues.apache.org/jira/browse/HUDI-1237
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: sivabalan narayanan
>Assignee: leesf
>Priority: Blocker
>  Labels: gsoc, gsoc2021, hudi-umbrellas, mentor
> Fix For: 0.10.0
>
>
> Umbrella ticket to track all tickets related to checkstyle, spotless, 
> warnings etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1236) [UMBRELLA] Long running test suite

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1236:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [UMBRELLA] Long running test suite
> --
>
> Key: HUDI-1236
> URL: https://issues.apache.org/jira/browse/HUDI-1236
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: hudi-umbrellas
> Fix For: 0.10.0
>
>
> Long running test suite that checks for correctness across all deployment 
> modes (batch/streaming) and writers (deltastreamer/spark) and readers (hive, 
> presto, spark)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1015) Audit all getAllPartitionPaths() calls and keep em out of fast path

2021-08-12 Thread Udit Mehrotra (Jira)


[jira] [Updated] (HUDI-73) Support vanilla Avro Kafka Source in HoodieDeltaStreamer

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-73?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-73:
--
Fix Version/s: (was: 0.9.0)
   0.10.0

> Support vanilla Avro Kafka Source in HoodieDeltaStreamer
> 
>
> Key: HUDI-73
> URL: https://issues.apache.org/jira/browse/HUDI-73
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available, sev:high, user-support-issues
> Fix For: 0.10.0
>
>
> Context : [https://github.com/uber/hudi/issues/597]
> Currently, Avro Kafka Source expects the installation to use Confluent 
> version with SchemaRegistry server running. We need to support the Kafka 
> installations which do not use Schema Registry by allowing 
> FileBasedSchemaProvider to be integrated to AvroKafkaSource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-304) Bring back spotless plugin

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-304:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> Bring back spotless plugin 
> ---
>
> Key: HUDI-304
> URL: https://issues.apache.org/jira/browse/HUDI-304
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup, Testing
>Reporter: Balaji Varadarajan
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> spotless plugin has been turned off as the eclipse style format it was 
> referencing was removed due to compliance reasons. 
> We use google style eclipse format with some changes
> 90c90
> < 
> ---
> > 
> 242c242
> <  value="100"/>
> ---
> >  > value="120"/>
>  
> The eclipse style sheet was originally obtained from 
> [https://github.com/google/styleguide] which CC -By 3.0 license which is not 
> compatible for source distribution (See 
> [https://www.apache.org/legal/resolved.html#cc-by]) 
>  
> We need to figure out a way to bring this back
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1038) Adding perf benchmark using jmh to Hudi

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1038:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Adding perf benchmark using jmh to Hudi
> ---
>
> Key: HUDI-1038
> URL: https://issues.apache.org/jira/browse/HUDI-1038
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance
>Affects Versions: 0.9.0
>Reporter: sivabalan narayanan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1120) Support spotless for scala

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1120:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Support spotless for scala
> --
>
> Key: HUDI-1120
> URL: https://issues.apache.org/jira/browse/HUDI-1120
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available, sev:normal, user-support-issues
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1063) Save in Google Cloud Storage not working

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1063:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Save in Google Cloud Storage not working
> 
>
> Key: HUDI-1063
> URL: https://issues.apache.org/jira/browse/HUDI-1063
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: David Lacalle Castillo
>Priority: Critical
>  Labels: sev:critical, sev:triage, user-support-issues
> Fix For: 0.10.0
>
>
> I added to spark submit the following properties: 
> {{--packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
>  \  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'}}
> Spark version 2.4.5 and Hadoop version 3.2.1
>  
> I am trying to save a Dataframe as follows in Google Cloud Storage as follows:
> tableName = "forecasts"
> basePath = "gs://hudi-datalake/" + tableName
> hudi_options = {
>  'hoodie.table.name': tableName,
>  'hoodie.datasource.write.recordkey.field': 'uuid',
>  'hoodie.datasource.write.partitionpath.field': 'partitionpath',
>  'hoodie.datasource.write.table.name': tableName,
>  'hoodie.datasource.write.operation': 'insert',
>  'hoodie.datasource.write.precombine.field': 'ts',
>  'hoodie.upsert.shuffle.parallelism': 2, 
>  'hoodie.insert.shuffle.parallelism': 2
> }
> results = results.selectExpr(
>  "ds as date",
>  "store",
>  "item",
>  "y as sales",
>  "yhat as sales_predicted",
>  "yhat_upper as sales_predicted_upper",
>  "yhat_lower as sales_predicted_lower",
>  "training_date")
> results.write.format("hudi"). \
>  options(**hudi_options). \
>  mode("overwrite"). \
>  save(basePath)
> I am getting the following error:
> Py4JJavaError: An error occurred while calling o312.save. : 
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V at 
> io.javalin.core.util.JettyServerUtil.defaultSessionHandler(JettyServerUtil.kt:50)
>  at io.javalin.Javalin.(Javalin.java:94) at 
> io.javalin.Javalin.create(Javalin.java:107) at 
> org.apache.hudi.timeline.service.TimelineService.startService(TimelineService.java:102)
>  at 
> org.apache.hudi.client.embedded.EmbeddedTimelineService.startServer(EmbeddedTimelineService.java:74)
>  at 
> org.apache.hudi.client.AbstractHoodieClient.startEmbeddedServerView(AbstractHoodieClient.java:102)
>  at 
> org.apache.hudi.client.AbstractHoodieClient.(AbstractHoodieClient.java:69)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.(AbstractHoodieWriteClient.java:83)
>  at 
> org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:137) 
> at 
> org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:124) 
> at 
> org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:120) 
> at 
> org.apache.hudi.DataSourceUtils.createHoodieClient(DataSourceUtils.java:195) 
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:135) 
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108) at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) 
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
>  at 
> org.apache.spark.sql.DataFra

[jira] [Updated] (HUDI-1975) Upgrade java-prometheus-client from 3.1.2 to 4.x

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1975:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Upgrade java-prometheus-client from 3.1.2 to 4.x
> 
>
> Key: HUDI-1975
> URL: https://issues.apache.org/jira/browse/HUDI-1975
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Nishith Agarwal
>Assignee: Vinay
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Find more details here -> https://github.com/apache/hudi/issues/2774



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1976) Upgrade hive, jackson, log4j, hadoop to remove vulnerability

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1976:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Upgrade hive, jackson, log4j, hadoop to remove vulnerability
> 
>
> Key: HUDI-1976
> URL: https://issues.apache.org/jira/browse/HUDI-1976
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Hive Integration
>Reporter: Nishith Agarwal
>Assignee: Vinay
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> [https://github.com/apache/hudi/issues/2827]
> [https://github.com/apache/hudi/issues/2826]
> [https://github.com/apache/hudi/issues/2824|https://github.com/apache/hudi/issues/2826]
> [https://github.com/apache/hudi/issues/2823|https://github.com/apache/hudi/issues/2826]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1856) Upstream changes made in PrestoDB to eliminate file listing to Trino

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1856:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Upstream changes made in PrestoDB to eliminate file listing to Trino
> 
>
> Key: HUDI-1856
> URL: https://issues.apache.org/jira/browse/HUDI-1856
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Nishith Agarwal
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: sev:high, sev:triage
> Fix For: 0.10.0
>
>
> inputFormat.getSplits() code was optimized for PrestoDB code base. This 
> change is not implemented / upstreamed in Trino.
>  
> Additionally, there are other changes that need to be upstreamed in Trino. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1401) Presto use of Metadata Table for file listings

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1401:

Priority: Major  (was: Blocker)

> Presto use of Metadata Table for file listings
> --
>
> Key: HUDI-1401
> URL: https://issues.apache.org/jira/browse/HUDI-1401
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Vinoth Chandar
>Assignee: Udit Mehrotra
>Priority: Major
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1401) Presto use of Metadata Table for file listings

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1401:

Fix Version/s: (was: 0.9.0)
   0.7.0

> Presto use of Metadata Table for file listings
> --
>
> Key: HUDI-1401
> URL: https://issues.apache.org/jira/browse/HUDI-1401
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Vinoth Chandar
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1363) Provide Option to drop columns after they are used to generate partition or record keys

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1363:

Labels: pull-request-available release-blocker  (was: 
pull-request-available)

> Provide Option to drop columns after they are used to generate partition or 
> record keys
> ---
>
> Key: HUDI-1363
> URL: https://issues.apache.org/jira/browse/HUDI-1363
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available, release-blocker
> Fix For: 0.9.0
>
>
> Context: https://github.com/apache/hudi/issues/2213



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1353) Incremental timeline support for pending clustering operations

2021-08-12 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra updated HUDI-1353:

Fix Version/s: (was: 0.9.0)
   0.10.0

> Incremental timeline support for pending clustering operations
> --
>
> Key: HUDI-1353
> URL: https://issues.apache.org/jira/browse/HUDI-1353
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >