[GitHub] [hudi] soma17dec opened a new issue #4729: [SUPPORT] During compaction, can I merge only modified columns in a record and leave others unchanged

2022-01-31 Thread GitBox


soma17dec opened a new issue #4729:
URL: https://github.com/apache/hudi/issues/4729


   Hi,
   
   We are in the process of building a Lake House using AWS services and Apache 
Hudi. In the process, we are extracting data from Oracle DB using AWS DMS 
Service and pushing the files to S3 object storage. As AWS DMS does both Full 
load and CDC replication, we are creating two different tasks and moving ahead 
with data loads to S3. With CDC task, delta files are generated with only 
Primary key and modified columns leaving rest all to NULLS. Unfortunately, we 
cannot do supplemental logging on all columns for our tables as it increases 
the overhead and have performance impact. 
   
   We are building Hudi tables after moving data as parquet files to S3 and 
running upserts in MOR mode. 
   
   We want to understand if HUDI has a capability to update the old full record 
(with all columns) with a new version that has only PK column and modified 
columns. 
   
   Eg:-
   
   Full Record - 101, Rahul, Manager, Engineering, 23-Apr-2020, $5, Y
   Delta Record - 101, , Sr Manager,,24-Apr-2022,,
   
   When the compaction happens, the HUDI table is returning
   
   101,,Sr Manager,,24-Apr-2022,,
   
   
   Expected Value - 101,Rahul, Sr Manager, Engineering, 24-Apr-2022, $5,Y
   
   Please Advice if there is a solution for this problem.
   
   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4662: [HUDI-3293] Fixing default value for clustering small file config

2022-01-31 Thread GitBox


hudi-bot removed a comment on pull request #4662:
URL: https://github.com/apache/hudi/pull/4662#issuecomment-1026484689


   
   ## CI report:
   
   * 789ecb457d2f5424674d512dd62d64480edc8c36 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5408)
 
   * 029895bca3e62168b02dec447357f48714ce43a2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5645)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4662: [HUDI-3293] Fixing default value for clustering small file config

2022-01-31 Thread GitBox


hudi-bot commented on pull request #4662:
URL: https://github.com/apache/hudi/pull/4662#issuecomment-1026523204


   
   ## CI report:
   
   * 029895bca3e62168b02dec447357f48714ce43a2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5645)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1180) Upgrade HBase to 2.x

2022-01-31 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-1180:

Reviewers: Vinoth Chandar  (was: Vinoth Chandar)

> Upgrade HBase to 2.x
> 
>
> Key: HUDI-1180
> URL: https://issues.apache.org/jira/browse/HUDI-1180
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Wenning Ding
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Trying to upgrade HBase to 2.3.3 but ran into several issues.
> According to the Hadoop version support matrix: 
> [http://hbase.apache.org/book.html#hadoop], also need to upgrade Hadoop to 
> 2.8.5+.
>  
> There are several API conflicts between HBase 2.2.3 and HBase 1.2.3, we need 
> to resolve this first. After resolving conflicts, I am able to compile it but 
> then I ran into a tricky jetty version issue during the testing:
> {code:java}
> [ERROR] TestHBaseIndex.testDelete()  Time elapsed: 4.705 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdate()  Time elapsed: 0.174 
> s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdateWithRollback()  Time 
> elapsed: 0.076 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSmallBatchSize()  Time elapsed: 0.122 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTagLocationAndDuplicateUpdate()  Time elapsed: 
> 0.16 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalGetsBatching()  Time elapsed: 1.771 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalPutsBatching()  Time elapsed: 0.082 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> 34206 [Thread-260] WARN  
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner  - DirectoryScanner: 
> shutdown has been called
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager  - 
> IncrementalBlockReportManager interrupted
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.DataNode  - Ending block pool service 
> for: Block pool BP-1058834949-10.0.0.2-1597189606506 (Datanode Uuid 
> cb7bd8aa-5d79-4955-b1ec-bdaf7f1b6431) service to localhost/127.0.0.1:55924
> 34246 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data1/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 34247 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data2/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 37192 [HBase-Metrics2-1] WARN  org.apache.hadoop.metrics2.impl.MetricsConfig  
> - Cannot locate configuration: tried 
> hadoop-metrics2-datanode.properties,hadoop-metrics2.properties
> 43904 
> [master/iad1-ws-cor-r12:0:becomeActiveMaster-SendThread(localhost:58768)] 
> WARN  org.apache.zookeeper.ClientCnxn  - Session 0x173dfeb0c8b0004 for server 
> null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> [INFO] 
> [INFO] Results:
> [INFO] 
> [ERROR] Errors: 
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   

[jira] [Updated] (HUDI-1180) Upgrade HBase to 2.x

2022-01-31 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-1180:

Reviewers: Vinoth Chandar  (was: Vinoth Chandar)

> Upgrade HBase to 2.x
> 
>
> Key: HUDI-1180
> URL: https://issues.apache.org/jira/browse/HUDI-1180
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Wenning Ding
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Trying to upgrade HBase to 2.3.3 but ran into several issues.
> According to the Hadoop version support matrix: 
> [http://hbase.apache.org/book.html#hadoop], also need to upgrade Hadoop to 
> 2.8.5+.
>  
> There are several API conflicts between HBase 2.2.3 and HBase 1.2.3, we need 
> to resolve this first. After resolving conflicts, I am able to compile it but 
> then I ran into a tricky jetty version issue during the testing:
> {code:java}
> [ERROR] TestHBaseIndex.testDelete()  Time elapsed: 4.705 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdate()  Time elapsed: 0.174 
> s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdateWithRollback()  Time 
> elapsed: 0.076 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSmallBatchSize()  Time elapsed: 0.122 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTagLocationAndDuplicateUpdate()  Time elapsed: 
> 0.16 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalGetsBatching()  Time elapsed: 1.771 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalPutsBatching()  Time elapsed: 0.082 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> 34206 [Thread-260] WARN  
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner  - DirectoryScanner: 
> shutdown has been called
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager  - 
> IncrementalBlockReportManager interrupted
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.DataNode  - Ending block pool service 
> for: Block pool BP-1058834949-10.0.0.2-1597189606506 (Datanode Uuid 
> cb7bd8aa-5d79-4955-b1ec-bdaf7f1b6431) service to localhost/127.0.0.1:55924
> 34246 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data1/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 34247 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data2/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 37192 [HBase-Metrics2-1] WARN  org.apache.hadoop.metrics2.impl.MetricsConfig  
> - Cannot locate configuration: tried 
> hadoop-metrics2-datanode.properties,hadoop-metrics2.properties
> 43904 
> [master/iad1-ws-cor-r12:0:becomeActiveMaster-SendThread(localhost:58768)] 
> WARN  org.apache.zookeeper.ClientCnxn  - Session 0x173dfeb0c8b0004 for server 
> null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> [INFO] 
> [INFO] Results:
> [INFO] 
> [ERROR] Errors: 
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   

[jira] [Commented] (HUDI-1180) Upgrade HBase to 2.x

2022-01-31 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485055#comment-17485055
 ] 

Ethan Guo commented on HUDI-1180:
-

WIP PR here: https://github.com/apache/hudi/pull/4695

> Upgrade HBase to 2.x
> 
>
> Key: HUDI-1180
> URL: https://issues.apache.org/jira/browse/HUDI-1180
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Wenning Ding
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Trying to upgrade HBase to 2.3.3 but ran into several issues.
> According to the Hadoop version support matrix: 
> [http://hbase.apache.org/book.html#hadoop], also need to upgrade Hadoop to 
> 2.8.5+.
>  
> There are several API conflicts between HBase 2.2.3 and HBase 1.2.3, we need 
> to resolve this first. After resolving conflicts, I am able to compile it but 
> then I ran into a tricky jetty version issue during the testing:
> {code:java}
> [ERROR] TestHBaseIndex.testDelete()  Time elapsed: 4.705 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdate()  Time elapsed: 0.174 
> s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdateWithRollback()  Time 
> elapsed: 0.076 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSmallBatchSize()  Time elapsed: 0.122 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTagLocationAndDuplicateUpdate()  Time elapsed: 
> 0.16 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalGetsBatching()  Time elapsed: 1.771 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalPutsBatching()  Time elapsed: 0.082 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> 34206 [Thread-260] WARN  
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner  - DirectoryScanner: 
> shutdown has been called
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager  - 
> IncrementalBlockReportManager interrupted
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.DataNode  - Ending block pool service 
> for: Block pool BP-1058834949-10.0.0.2-1597189606506 (Datanode Uuid 
> cb7bd8aa-5d79-4955-b1ec-bdaf7f1b6431) service to localhost/127.0.0.1:55924
> 34246 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data1/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 34247 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data2/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 37192 [HBase-Metrics2-1] WARN  org.apache.hadoop.metrics2.impl.MetricsConfig  
> - Cannot locate configuration: tried 
> hadoop-metrics2-datanode.properties,hadoop-metrics2.properties
> 43904 
> [master/iad1-ws-cor-r12:0:becomeActiveMaster-SendThread(localhost:58768)] 
> WARN  org.apache.zookeeper.ClientCnxn  - Session 0x173dfeb0c8b0004 for server 
> null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> [INFO] 
> [INFO] Results:
> [INFO] 
> [ERROR] Errors: 
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   

[jira] [Updated] (HUDI-1180) Upgrade HBase to 2.x

2022-01-31 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-1180:

Reviewers: Vinoth Chandar

> Upgrade HBase to 2.x
> 
>
> Key: HUDI-1180
> URL: https://issues.apache.org/jira/browse/HUDI-1180
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Wenning Ding
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Trying to upgrade HBase to 2.3.3 but ran into several issues.
> According to the Hadoop version support matrix: 
> [http://hbase.apache.org/book.html#hadoop], also need to upgrade Hadoop to 
> 2.8.5+.
>  
> There are several API conflicts between HBase 2.2.3 and HBase 1.2.3, we need 
> to resolve this first. After resolving conflicts, I am able to compile it but 
> then I ran into a tricky jetty version issue during the testing:
> {code:java}
> [ERROR] TestHBaseIndex.testDelete()  Time elapsed: 4.705 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdate()  Time elapsed: 0.174 
> s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdateWithRollback()  Time 
> elapsed: 0.076 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSmallBatchSize()  Time elapsed: 0.122 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTagLocationAndDuplicateUpdate()  Time elapsed: 
> 0.16 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalGetsBatching()  Time elapsed: 1.771 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalPutsBatching()  Time elapsed: 0.082 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> 34206 [Thread-260] WARN  
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner  - DirectoryScanner: 
> shutdown has been called
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager  - 
> IncrementalBlockReportManager interrupted
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.DataNode  - Ending block pool service 
> for: Block pool BP-1058834949-10.0.0.2-1597189606506 (Datanode Uuid 
> cb7bd8aa-5d79-4955-b1ec-bdaf7f1b6431) service to localhost/127.0.0.1:55924
> 34246 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data1/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 34247 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data2/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 37192 [HBase-Metrics2-1] WARN  org.apache.hadoop.metrics2.impl.MetricsConfig  
> - Cannot locate configuration: tried 
> hadoop-metrics2-datanode.properties,hadoop-metrics2.properties
> 43904 
> [master/iad1-ws-cor-r12:0:becomeActiveMaster-SendThread(localhost:58768)] 
> WARN  org.apache.zookeeper.ClientCnxn  - Session 0x173dfeb0c8b0004 for server 
> null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> [INFO] 
> [INFO] Results:
> [INFO] 
> [ERROR] Errors: 
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]  

[GitHub] [hudi] guyuqi commented on pull request #4617: HUDI-1657: build failed on AArch64, Fedora 33

2022-01-31 Thread GitBox


guyuqi commented on pull request #4617:
URL: https://github.com/apache/hudi/pull/4617#issuecomment-1026493615


   > @guyuqi : can you respond to @yihua 's clarification above.
   
   Sorry for the late reply.
   I’m on Chinese New Year vacation and limited to access the PC. I’ll update 
the PR at the end of this week. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-01-31 Thread GitBox


hudi-bot removed a comment on pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#issuecomment-1026438796


   
   ## CI report:
   
   * f729e0283fc4780c460bd1cd5c427d740be7366d Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5636)
 
   * 547e500947154e6e3e998cdade9fc2be2d5508ec Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5640)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-01-31 Thread GitBox


hudi-bot commented on pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#issuecomment-1026492494


   
   ## CI report:
   
   * 547e500947154e6e3e998cdade9fc2be2d5508ec Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5640)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2584) Unit tests for bloom filter index based out of metadata table.

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2584:
--
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24  (was: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, 
Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31)

> Unit tests for bloom filter index based out of metadata table. 
> ---
>
> Key: HUDI-2584
> URL: https://issues.apache.org/jira/browse/HUDI-2584
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: release-blocker
> Fix For: 0.11.0
>
>
> Test Bloom filter based out of metadata table.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3181) Address test failures after enabling metadata index for bloom filters and column stats

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3181:
--
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24  (was: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, 
Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31)

> Address test failures after enabling metadata index for bloom filters and 
> column stats
> --
>
> Key: HUDI-3181
> URL: https://issues.apache.org/jira/browse/HUDI-3181
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1492) Enhance DeltaWriteStat with block level metadata correctly for storage schemes that support appends

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-1492:
--
Sprint: Hudi-Sprint-Jan-31

> Enhance DeltaWriteStat with block level metadata correctly for storage 
> schemes that support appends
> ---
>
> Key: HUDI-1492
> URL: https://issues.apache.org/jira/browse/HUDI-1492
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Vinoth Chandar
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Current implementation simply uses the
> {code:java}
> String pathWithPartition = hoodieWriteStat.getPath(); {code}
> to write the metadata table. this is problematic, if the delta write was 
> merely an append. and can technically add duplicate files into the metadata 
> table 
> (not sure if this is a problem per se. but filing a Jira to track and either 
> close/fix ) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3203) Meta bloom index should use the bloom filter type property to construct back the bloom filter instant

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3203:
--
Sprint: Hudi-Sprint-Jan-31

> Meta bloom index should use the bloom filter type property to construct back 
> the bloom filter instant
> -
>
> Key: HUDI-3203
> URL: https://issues.apache.org/jira/browse/HUDI-3203
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3142) Metadata new Indices initialization during table creation

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3142:
--
Sprint: Hudi-Sprint-Jan-31

> Metadata new Indices initialization during table creation 
> --
>
> Key: HUDI-3142
> URL: https://issues.apache.org/jira/browse/HUDI-3142
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Metadata table when created for the first time, checks if index 
> initialization is needed by comparing with the data table timeline. Today the 
> initialization only takes care of metadata files partition. We need to do 
> similar initialization for all the new index partitions - bloom_filters, 
> col_stats.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3356) Conversion of write stats to metadata index records should use HoodieData throughout

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3356:
--
Sprint: Hudi-Sprint-Jan-31

> Conversion of write stats to metadata index records should use HoodieData 
> throughout
> 
>
> Key: HUDI-3356
> URL: https://issues.apache.org/jira/browse/HUDI-3356
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>
> HoodieMetadataTableUtil convertMetadataToRecords() converts all write stats 
> to metadata index records to List of HoodieRecords before passing on them to 
> engine specific commit() to prep records. This can OOM driver. We need to use 
> HoodieData throughout. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] yihua commented on pull request #4695: [WIP][DO NOT MERGE][CI Test Only] Remove hbase-server dependency, pull in HFile related classes, with deps resolution

2022-01-31 Thread GitBox


yihua commented on pull request #4695:
URL: https://github.com/apache/hudi/pull/4695#issuecomment-1026487129


   cc @vinothchandar 
   
   My approach is pulling the HFile format relevant classes from HBase repo 
with rel 2.4.9, into hudi repo `hudi-io` module with renamed package of 
`org.apache.hudi.hbase` instead of `org.apache.hadoop.hbase`.  I trimmed some 
classes to limit the number of deps pulled in.  All the backward compatibility 
logic of KeyValue.KVComparator (hbase1) vs CellComparator (hbase2) is pulled in 
as well so we can control that.  In such a way, any hudi logic using HFile 
format is going to use internal `org.apache.hudi.hbase` classes, while 
SparkHoodieHBaseIndex still uses hbase lib with `org.apache.hadoop.hbase` 
classes (these two are independent).
   
   A few things to finalize:
   - I'm questioning whether we should flip the hbase version in hudi repo, 
since if we can unlock the HFile format for metadata table, Presto, Trino, with 
the first WIP PR, there is no real need to upgrade hbase version to 2.x, which 
could introduce compatibility issues for SparkHoodieHBaseIndex.  Anything I 
miss here?  wdyt?
   - Right now, protobuf is used to generate proto classes and I pulled in the 
.proto and protobuf libs (hudi-io-proto module).  Should I just put the 
generated java classes inside the repo and get rid of the proto related files 
altogether?  I can keep hudi-io-proto module though and make hudi-io include 
generated code, not depending on hudi-io-proto, so in the future we can still 
evolve the protos.
   - Regarding the new dependencies pulled in, I can further trim the list down 
if some can cause conflict, e.g., `commons-lang3`, `protobuf`:
   ```
 org.apache.hadoop
 hadoop-client
 provided
   
 org.apache.hadoop
 hadoop-hdfs
 provided
 
 org.apache.hbase.thirdparty
 hbase-shaded-protobuf
 4.0.1
   
 org.apache.hbase.thirdparty
 hbase-shaded-miscellaneous
 4.0.1
   
 org.apache.hbase.thirdparty
 hbase-shaded-gson
 4.0.1
   
 org.apache.hbase.thirdparty
 hbase-shaded-netty
 4.0.1
   
 org.apache.htrace
 htrace-core4
 4.2.0-incubating
   
 org.apache.commons
 commons-lang3
 3.12.0
 compile
   
 org.apache.yetus
 audience-annotations
 0.13.0
   
 com.esotericsoftware
 kryo-shaded
 4.0.2
   ```  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4662: [HUDI-3293] Fixing default value for clustering small file config

2022-01-31 Thread GitBox


hudi-bot commented on pull request #4662:
URL: https://github.com/apache/hudi/pull/4662#issuecomment-1026484689


   
   ## CI report:
   
   * 789ecb457d2f5424674d512dd62d64480edc8c36 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5408)
 
   * 029895bca3e62168b02dec447357f48714ce43a2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5645)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4662: [HUDI-3293] Fixing default value for clustering small file config

2022-01-31 Thread GitBox


hudi-bot removed a comment on pull request #4662:
URL: https://github.com/apache/hudi/pull/4662#issuecomment-1026483309


   
   ## CI report:
   
   * 789ecb457d2f5424674d512dd62d64480edc8c36 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5408)
 
   * 029895bca3e62168b02dec447357f48714ce43a2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #4617: HUDI-1657: build failed on AArch64, Fedora 33

2022-01-31 Thread GitBox


nsivabalan commented on pull request #4617:
URL: https://github.com/apache/hudi/pull/4617#issuecomment-1026484269


   @guyuqi : can you respond to @yihua 's clarification above. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4662: [HUDI-3293] Fixing default value for clustering small file config

2022-01-31 Thread GitBox


hudi-bot commented on pull request #4662:
URL: https://github.com/apache/hudi/pull/4662#issuecomment-1026483309


   
   ## CI report:
   
   * 789ecb457d2f5424674d512dd62d64480edc8c36 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5408)
 
   * 029895bca3e62168b02dec447357f48714ce43a2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4662: [HUDI-3293] Fixing default value for clustering small file config

2022-01-31 Thread GitBox


hudi-bot removed a comment on pull request #4662:
URL: https://github.com/apache/hudi/pull/4662#issuecomment-1018317329


   
   ## CI report:
   
   * 789ecb457d2f5424674d512dd62d64480edc8c36 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5408)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua closed pull request #4684: [WIP][DO NOT MERGE][CI Test Only] Remove hbase-server dependency and pull in HFile related classes

2022-01-31 Thread GitBox


yihua closed pull request #4684:
URL: https://github.com/apache/hudi/pull/4684


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2458) Relax compaction in metadata being fenced based on inflight requests in data table

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2458:
--
Sprint: Hudi-Sprint-Jan-24  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31)

> Relax compaction in metadata being fenced based on inflight requests in data 
> table
> --
>
> Key: HUDI-2458
> URL: https://issues.apache.org/jira/browse/HUDI-2458
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Relax compaction in metadata being fenced based on inflight requests in data 
> table.
> Compaction in metadata is triggered only if there are no inflight requests in 
> data table. This might cause liveness problem since for very large 
> deployments, we could either have compaction or clustering always in 
> progress. So, we should try to see how we can relax this constraint.
>  
> Proposal to remove this dependency:
> With recent addition of spurious deletes config, we can actually get away 
> with this. 
> As of now, we have 3 inter linked nuances.
>  - Compaction in metadata may not kick in, if there are any inflight 
> operations in data table. 
>  - Rollback when being applied to metadata table has a dependency on last 
> compaction instant in metadata table. We might even throw exception if 
> instant being rolledback is < latest metadata compaction instant time. 
>  - Archival in data table is fenced by latest compaction in metadata table. 
>  
> So, just incase data timeline has any dangling inflght operation (lets say 
> someone tried clustering, and killed midway and did not ever attempt again), 
> metadata compaction will never kick in at all for good. I need to check what 
> does archival do for such inflight operations in data table though when it 
> tries to archive near by commits. 
>  
> So, with spurious deletes support which we added recently, all these can be 
> much simplified. 
> Whenever we want to apply a rollback commit, we don't need to take different 
> actions based on whether the commit being rolled back is already committed to 
> metadata table or not. Just go ahead and apply the rollback. Merging of 
> metadata payload records will take care of this. If the commit was already 
> synced, final merged payload may not have spurious deletes. If the commit 
> being rolledback was never committed to metadata, final merged payload may 
> have some spurious deletes which we can ignore. 
> With this, compaction in metadata does not need to have any dependency on 
> inflight operations in data table. 
> And we can loosen up the dependency of archival in data table on metadata 
> table compaction as well. 
> So, in summary, all the 3 dependencies quoted above will be moot if we go 
> with this approach. Archival in data table does not have any dependency on 
> metadata table compaction. Rollback when being applied to metadata table does 
> not care about last metadata table compaction. Compaction in metadata table 
> can proceed even if there are inflight operations in data table. 
>  
> Especially our logic to apply rollback metadata to metadata table will become 
> a lot simpler and is easy to reason about. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] manojpec commented on a change in pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-31 Thread GitBox


manojpec commented on a change in pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#discussion_r796262469



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##
@@ -79,14 +98,53 @@ public static void deleteMetadataTable(String basePath, 
HoodieEngineContext cont
 }
   }
 
+  /**
+   * Convert commit action to metadata records for the enabled partition types.
+   *
+   * @param commitMetadata  - Commit action metadata
+   * @param dataMetaClient  - Meta client for the data 
table
+   * @param isMetaIndexColumnStatsForAllColumns - Do all columns need meta 
indexing?
+   * @param instantTime - Action instant time
+   * @return Map of partition to metadata records for the commit action
+   */
+  public static Map> 
convertMetadataToRecords(
+  HoodieEngineContext context, List 
enabledPartitionTypes,
+  HoodieCommitMetadata commitMetadata, HoodieTableMetaClient 
dataMetaClient,
+  boolean isMetaIndexColumnStatsForAllColumns, String instantTime) {
+final Map> 
partitionToRecordsMap = new HashMap<>();
+final HoodieData filesPartitionRecordsRDD = 
context.parallelize(
+convertMetadataToFilesPartitionRecords(commitMetadata, instantTime), 
1);
+partitionToRecordsMap.put(MetadataPartitionType.FILES, 
filesPartitionRecordsRDD);
+
+if (enabledPartitionTypes.contains(MetadataPartitionType.BLOOM_FILTERS)) {
+  final List metadataBloomFilterRecords = 
convertMetadataToBloomFilterRecords(commitMetadata,

Review comment:
   HUDI-3356 is addressing this issue.

##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##
@@ -79,14 +98,53 @@ public static void deleteMetadataTable(String basePath, 
HoodieEngineContext cont
 }
   }
 
+  /**
+   * Convert commit action to metadata records for the enabled partition types.
+   *
+   * @param commitMetadata  - Commit action metadata
+   * @param dataMetaClient  - Meta client for the data 
table
+   * @param isMetaIndexColumnStatsForAllColumns - Do all columns need meta 
indexing?
+   * @param instantTime - Action instant time
+   * @return Map of partition to metadata records for the commit action
+   */
+  public static Map> 
convertMetadataToRecords(
+  HoodieEngineContext context, List 
enabledPartitionTypes,
+  HoodieCommitMetadata commitMetadata, HoodieTableMetaClient 
dataMetaClient,
+  boolean isMetaIndexColumnStatsForAllColumns, String instantTime) {
+final Map> 
partitionToRecordsMap = new HashMap<>();
+final HoodieData filesPartitionRecordsRDD = 
context.parallelize(
+convertMetadataToFilesPartitionRecords(commitMetadata, instantTime), 
1);
+partitionToRecordsMap.put(MetadataPartitionType.FILES, 
filesPartitionRecordsRDD);
+
+if (enabledPartitionTypes.contains(MetadataPartitionType.BLOOM_FILTERS)) {
+  final List metadataBloomFilterRecords = 
convertMetadataToBloomFilterRecords(commitMetadata,
+  dataMetaClient, instantTime);
+  if (!metadataBloomFilterRecords.isEmpty()) {
+final HoodieData metadataBloomFilterRecordsRDD = 
context.parallelize(metadataBloomFilterRecords, 1);
+partitionToRecordsMap.put(MetadataPartitionType.BLOOM_FILTERS, 
metadataBloomFilterRecordsRDD);
+  }
+}
+
+if (enabledPartitionTypes.contains(MetadataPartitionType.COLUMN_STATS)) {
+  final List metadataColumnStats = 
convertMetadataToColumnStatsRecords(commitMetadata, context,

Review comment:
   HUDI-3356 is addressing this issue.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] manojpec commented on a change in pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-31 Thread GitBox


manojpec commented on a change in pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#discussion_r796262145



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##
@@ -124,14 +182,111 @@ public static void deleteMetadataTable(String basePath, 
HoodieEngineContext cont
 return records;
   }
 
+  /**
+   * Convert commit action metadata to bloom filter records.
+   *
+   * @param commitMetadata - Commit action metadata
+   * @param dataMetaClient - Meta client for the data table
+   * @param instantTime- Action instant time
+   * @return List of metadata table records
+   */
+  public static List 
convertMetadataToBloomFilterRecords(HoodieCommitMetadata commitMetadata,
+   
HoodieTableMetaClient dataMetaClient,
+   String 
instantTime) {
+List records = new LinkedList<>();
+commitMetadata.getPartitionToWriteStats().forEach((partitionStatName, 
writeStats) -> {

Review comment:
   yes, if there are only log files then this method will return an empty 
list and it is handled at the caller. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] manojpec commented on a change in pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-31 Thread GitBox


manojpec commented on a change in pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#discussion_r796261359



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
##
@@ -319,9 +616,88 @@ private static void 
processRollbackMetadata(HoodieActiveTimeline metadataTableTi
 return records;
   }
 
+  /**
+   * Convert rollback action metadata to bloom filter index records.
+   */
+  private static List 
convertFilesToBloomFilterRecords(HoodieEngineContext engineContext,
+ 
HoodieTableMetaClient dataMetaClient,
+ 
Map> partitionToDeletedFiles,
+ 
Map> partitionToAppendedFiles,
+ String 
instantTime) {
+List records = new LinkedList<>();
+partitionToDeletedFiles.forEach((partitionName, deletedFileList) -> 
deletedFileList.forEach(deletedFile -> {
+  if (!FSUtils.isBaseFile(new Path(deletedFile))) {
+return;
+  }
+
+  final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? 
NON_PARTITIONED_NAME : partitionName;
+  records.add(HoodieMetadataPayload.createBloomFilterMetadataRecord(
+  partition, deletedFile, instantTime, ByteBuffer.allocate(0), true));
+}));
+
+partitionToAppendedFiles.forEach((partitionName, appendedFileMap) -> {
+  final String partition = partitionName.equals(EMPTY_PARTITION_NAME) ? 
NON_PARTITIONED_NAME : partitionName;
+  appendedFileMap.forEach((appendedFile, length) -> {
+if (!FSUtils.isBaseFile(new Path(appendedFile))) {
+  return;
+}
+final String pathWithPartition = partitionName + "/" + appendedFile;
+final Path appendedFilePath = new Path(dataMetaClient.getBasePath(), 
pathWithPartition);
+try {
+  HoodieFileReader fileReader =
+  
HoodieFileReaderFactory.getFileReader(dataMetaClient.getHadoopConf(), 
appendedFilePath);
+  final BloomFilter fileBloomFilter = fileReader.readBloomFilter();
+  if (fileBloomFilter == null) {
+LOG.error("Failed to read bloom filter for " + appendedFilePath);
+return;
+  }
+  ByteBuffer bloomByteBuffer = 
ByteBuffer.wrap(fileBloomFilter.serializeToString().getBytes());
+  HoodieRecord record = 
HoodieMetadataPayload.createBloomFilterMetadataRecord(
+  partition, appendedFile, instantTime, bloomByteBuffer, false);
+  records.add(record);
+  fileReader.close();
+} catch (IOException e) {
+  LOG.error("Failed to get bloom filter for file: " + 
appendedFilePath);
+}
+  });
+});
+return records;
+  }
+
+  /**
+   * Convert rollback action metadata to column stats index records.
+   */
+  private static List 
convertFilesToColumnStatsRecords(HoodieEngineContext engineContext,
+ 
HoodieTableMetaClient datasetMetaClient,
+ 
Map> partitionToDeletedFiles,
+ 
Map> partitionToAppendedFiles,
+ String 
instantTime) {
+List records = new LinkedList<>();
+List latestColumns = getLatestColumns(datasetMetaClient);
+partitionToDeletedFiles.forEach((partitionName, deletedFileList) -> 
deletedFileList.forEach(deletedFile -> {

Review comment:
   Right, HUDI-3356 will address this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3989: [HUDI-2589] RFC-37: Metadata table based bloom index

2022-01-31 Thread GitBox


hudi-bot commented on pull request #3989:
URL: https://github.com/apache/hudi/pull/3989#issuecomment-1026475329


   
   ## CI report:
   
   * c57ad0d546e2150e450b361ce228ba80cb29db69 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5639)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3989: [HUDI-2589] RFC-37: Metadata table based bloom index

2022-01-31 Thread GitBox


hudi-bot removed a comment on pull request #3989:
URL: https://github.com/apache/hudi/pull/3989#issuecomment-1026424661


   
   ## CI report:
   
   * 2686dc233dd86c814f1dd0ddcf0f9c0edd459af5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4603)
 
   * c57ad0d546e2150e450b361ce228ba80cb29db69 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5639)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] manojpec commented on a change in pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-31 Thread GitBox


manojpec commented on a change in pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#discussion_r796260578



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieColumnRangeMetadata.java
##
@@ -30,16 +28,21 @@
   private final String columnName;
   private final T minValue;
   private final T maxValue;
-  private final long numNulls;
-  private final PrimitiveStringifier stringifier;
+  private final long nullCount;
+  private final long valueCount;
+  private final long totalSize;
+  private final long totalUncompressedSize;
 
-  public HoodieColumnRangeMetadata(final String filePath, final String 
columnName, final T minValue, final T maxValue, final long numNulls, final 
PrimitiveStringifier stringifier) {
+  public HoodieColumnRangeMetadata(final String filePath, final String 
columnName, final T minValue, final T maxValue,

Review comment:
   HoodieColumnRangeMetadata  is an existing class. ParquetUtils and 
ColumnStatsIndexHelper use them widely. Don't prefer to rename this class in 
the scope of this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-31 Thread GitBox


hudi-bot removed a comment on pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#issuecomment-1026453352


   
   ## CI report:
   
   * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN
   * d6a1b6f1096fbdb00342c36ab5f241d3633981d6 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5642)
 
   * 3784c4bf415fec6e48f1438c2f14eb4061c608cf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-31 Thread GitBox


hudi-bot commented on pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#issuecomment-1026473489


   
   ## CI report:
   
   * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN
   * d6a1b6f1096fbdb00342c36ab5f241d3633981d6 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5642)
 
   * 3784c4bf415fec6e48f1438c2f14eb4061c608cf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5644)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3298) Explore and Try Byteman for error injection on Hudi write flow

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3298:
--
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24  (was: Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31)

> Explore and Try Byteman for error injection on Hudi write flow
> --
>
> Key: HUDI-3298
> URL: https://issues.apache.org/jira/browse/HUDI-3298
> Project: Apache Hudi
>  Issue Type: Task
>  Components: tests-ci
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3298) Explore and Try Byteman for error injection on Hudi write flow

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3298:
--
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: 
Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Explore and Try Byteman for error injection on Hudi write flow
> --
>
> Key: HUDI-3298
> URL: https://issues.apache.org/jira/browse/HUDI-3298
> Project: Apache Hudi
>  Issue Type: Task
>  Components: tests-ci
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3284) Restore hudi-presto-bundle changes and upgrade presto version in docker setup

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3284:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Restore hudi-presto-bundle changes and upgrade presto version in docker setup
> -
>
> Key: HUDI-3284
> URL: https://issues.apache.org/jira/browse/HUDI-3284
> Project: Apache Hudi
>  Issue Type: Task
>  Components: trino-presto
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> For more details, https://github.com/apache/hudi/pull/4646



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2458) Relax compaction in metadata being fenced based on inflight requests in data table

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2458:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Relax compaction in metadata being fenced based on inflight requests in data 
> table
> --
>
> Key: HUDI-2458
> URL: https://issues.apache.org/jira/browse/HUDI-2458
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Relax compaction in metadata being fenced based on inflight requests in data 
> table.
> Compaction in metadata is triggered only if there are no inflight requests in 
> data table. This might cause liveness problem since for very large 
> deployments, we could either have compaction or clustering always in 
> progress. So, we should try to see how we can relax this constraint.
>  
> Proposal to remove this dependency:
> With recent addition of spurious deletes config, we can actually get away 
> with this. 
> As of now, we have 3 inter linked nuances.
>  - Compaction in metadata may not kick in, if there are any inflight 
> operations in data table. 
>  - Rollback when being applied to metadata table has a dependency on last 
> compaction instant in metadata table. We might even throw exception if 
> instant being rolledback is < latest metadata compaction instant time. 
>  - Archival in data table is fenced by latest compaction in metadata table. 
>  
> So, just incase data timeline has any dangling inflght operation (lets say 
> someone tried clustering, and killed midway and did not ever attempt again), 
> metadata compaction will never kick in at all for good. I need to check what 
> does archival do for such inflight operations in data table though when it 
> tries to archive near by commits. 
>  
> So, with spurious deletes support which we added recently, all these can be 
> much simplified. 
> Whenever we want to apply a rollback commit, we don't need to take different 
> actions based on whether the commit being rolled back is already committed to 
> metadata table or not. Just go ahead and apply the rollback. Merging of 
> metadata payload records will take care of this. If the commit was already 
> synced, final merged payload may not have spurious deletes. If the commit 
> being rolledback was never committed to metadata, final merged payload may 
> have some spurious deletes which we can ignore. 
> With this, compaction in metadata does not need to have any dependency on 
> inflight operations in data table. 
> And we can loosen up the dependency of archival in data table on metadata 
> table compaction as well. 
> So, in summary, all the 3 dependencies quoted above will be moot if we go 
> with this approach. Archival in data table does not have any dependency on 
> metadata table compaction. Rollback when being applied to metadata table does 
> not care about last metadata table compaction. Compaction in metadata table 
> can proceed even if there are inflight operations in data table. 
>  
> Especially our logic to apply rollback metadata to metadata table will become 
> a lot simpler and is easy to reason about. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3337) ParquetUtils fails extracting Parquet Column Range Metadata

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3337:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> ParquetUtils fails extracting Parquet Column Range Metadata
> ---
>
> Key: HUDI-3337
> URL: https://issues.apache.org/jira/browse/HUDI-3337
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> [~manojpec] discovered following issue while testing MT flows, with 
> {{TestHoodieBackedMetadata#testTableOperationsWithMetadataIndex}} failing 
> with:
>  
> {code:java}
> 17400 [Executor task launch worker for task 240] ERROR 
> org.apache.hudi.metadata.HoodieTableMetadataUtil  - Failed to read column 
> stats for 
> /var/folders/t7/kr69rlvx5rdd824m61zjqkjrgn/T/junit2402861080324269156/dataset/2016/03/15/44396fda-48db-4d10-9f47-275c39317115-0_0-101-234_003.parquet
> java.lang.ClassCastException: 
> org.apache.parquet.io.api.Binary$ByteArrayBackedBinary cannot be cast to 
> java.lang.Integer
>   at 
> org.apache.hudi.common.util.ParquetUtils.convertToNativeJavaType(ParquetUtils.java:369)
>   at 
> org.apache.hudi.common.util.ParquetUtils.lambda$null$2(ParquetUtils.java:305)
>   at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
>   at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
>   at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>   at 
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
>   at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
>   at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
>   at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>   at 
> org.apache.hudi.common.util.ParquetUtils.readRangeFromParquetMetadata(ParquetUtils.java:313)
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.getColumnStats(HoodieTableMetadataUtil.java:878)
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.translateWriteStatToColumnStats(HoodieTableMetadataUtil.java:858)
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$createColumnStatsFromWriteStats$7e2376a$1(HoodieTableMetadataUtil.java:819)
>   at 
> org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:134)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1334)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
>   at 
> 

[jira] [Updated] (HUDI-3318) Write RFC regarding proposed changes to the RecordPayload hierarchy

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3318:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Write RFC regarding proposed changes to the RecordPayload hierarchy
> ---
>
> Key: HUDI-3318
> URL: https://issues.apache.org/jira/browse/HUDI-3318
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3322) Rollback Plan for Delta Commits constructed incorrectly

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3322:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Rollback Plan for Delta Commits constructed incorrectly
> ---
>
> Key: HUDI-3322
> URL: https://issues.apache.org/jira/browse/HUDI-3322
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Diving deeper into the issue of HUDI-3279, i've realized that the root-cause 
> of the problem is actually a Rollback Plan for Delta Commits is composed 
> incorrectly for MOR tables. Consider the case below (we will continue to rely 
> on test of 
> {{{}TestHoodieSparkMergeOnReadTableRollback#testMORTableRestore{}}}):
> Hoodie Timeline:
> {code:java}
> alexey.kudinkin@alexeys-mbp junit5494198038159268501 % ls -la .hoodie
> total 400
> drwxr-xr-x  52 alexey.kudinkin  staff  1664 Jan 25 13:08 .
> drwx--   5 alexey.kudinkin  staff   160 Jan 25 12:56 ..
> -rw-r--r--   1 alexey.kudinkin  staff    48 Jan 25 12:56 .001.deltacommit.crc
> -rw-r--r--   1 alexey.kudinkin  staff    28 Jan 25 12:56 
> .001.deltacommit.inflight.crc
> -rw-r--r--   1 alexey.kudinkin  staff     8 Jan 25 12:56 
> .001.deltacommit.requested.crc
> -rw-r--r--   1 alexey.kudinkin  staff    52 Jan 25 12:56 .002.deltacommit.crc
> -rw-r--r--   1 alexey.kudinkin  staff    48 Jan 25 12:56 
> .002.deltacommit.inflight.crc
> -rw-r--r--   1 alexey.kudinkin  staff     8 Jan 25 12:56 
> .002.deltacommit.requested.crc
> -rw-r--r--   1 alexey.kudinkin  staff    56 Jan 25 12:57 .003.deltacommit.crc
> -rw-r--r--   1 alexey.kudinkin  staff    48 Jan 25 12:57 
> .003.deltacommit.inflight.crc
> -rw-r--r--   1 alexey.kudinkin  staff     8 Jan 25 12:56 
> .003.deltacommit.requested.crc
> -rw-r--r--   1 alexey.kudinkin  staff    56 Jan 25 12:57 .004.deltacommit.crc
> -rw-r--r--   1 alexey.kudinkin  staff    48 Jan 25 12:57 
> .004.deltacommit.inflight.crc
> -rw-r--r--   1 alexey.kudinkin  staff     8 Jan 25 12:57 
> .004.deltacommit.requested.crc
> -rw-r--r--   1 alexey.kudinkin  staff    48 Jan 25 12:57 .005.commit.crc
> -rw-r--r--   1 alexey.kudinkin  staff     8 Jan 25 12:57 
> .005.compaction.inflight.crc
> -rw-r--r--   1 alexey.kudinkin  staff    28 Jan 25 12:57 
> .005.compaction.requested.crc
> -rw-r--r--   1 alexey.kudinkin  staff    52 Jan 25 12:57 .006.deltacommit.crc
> -rw-r--r--   1 alexey.kudinkin  staff    48 Jan 25 12:57 
> .006.deltacommit.inflight.crc
> -rw-r--r--   1 alexey.kudinkin  staff     8 Jan 25 12:57 
> .006.deltacommit.requested.crc
> -rw-r--r--   1 alexey.kudinkin  staff    52 Jan 25 12:57 .007.deltacommit.crc
> -rw-r--r--   1 alexey.kudinkin  staff    48 Jan 25 12:57 
> .007.deltacommit.inflight.crc
> -rw-r--r--   1 alexey.kudinkin  staff     8 Jan 25 12:57 
> .007.deltacommit.requested.crc
> -rw-r--r--   1 alexey.kudinkin  staff     8 Jan 25 13:08 
> .20220125130818473.restore.inflight.crc
> drwxr-xr-x   5 alexey.kudinkin  staff   160 Jan 25 12:57 .aux
> -rw-r--r--   1 alexey.kudinkin  staff    12 Jan 25 12:56 
> .hoodie.properties.crc
> drwxr-xr-x   2 alexey.kudinkin  staff    64 Jan 25 12:57 .temp
> -rw-r--r--   1 alexey.kudinkin  staff  4822 Jan 25 12:56 001.deltacommit
> -rw-r--r--   1 alexey.kudinkin  staff  2499 Jan 25 12:56 
> 001.deltacommit.inflight
> -rw-r--r--   1 alexey.kudinkin  staff     0 Jan 25 12:56 
> 001.deltacommit.requested
> -rw-r--r--   1 alexey.kudinkin  staff  5451 Jan 25 12:56 002.deltacommit
> -rw-r--r--   1 alexey.kudinkin  staff  4620 Jan 25 12:56 
> 002.deltacommit.inflight
> -rw-r--r--   1 alexey.kudinkin  staff     0 Jan 25 12:56 
> 002.deltacommit.requested
> -rw-r--r--   1 alexey.kudinkin  staff  5646 Jan 25 12:57 003.deltacommit
> -rw-r--r--   1 alexey.kudinkin  staff  4620 Jan 25 12:57 
> 003.deltacommit.inflight
> -rw-r--r--   1 alexey.kudinkin  staff     0 Jan 25 12:56 
> 003.deltacommit.requested
> -rw-r--r--   1 alexey.kudinkin  staff  5835 Jan 25 12:57 004.deltacommit
> -rw-r--r--   1 alexey.kudinkin  staff  4620 Jan 25 12:57 
> 004.deltacommit.inflight
> -rw-r--r--   1 alexey.kudinkin  staff     0 Jan 25 12:57 
> 004.deltacommit.requested
> -rw-r--r--   1 alexey.kudinkin  staff  4756 Jan 25 12:57 005.commit
> -rw-r--r--   1 alexey.kudinkin  staff     0 Jan 25 12:57 
> 005.compaction.inflight
> -rw-r--r--   1 alexey.kudinkin  staff  2507 Jan 25 12:57 
> 005.compaction.requested
> -rw-r--r--   1 alexey.kudinkin  staff  5362 Jan 25 12:57 006.deltacommit
> -rw-r--r--   1 alexey.kudinkin  staff  4620 Jan 25 12:57 
> 006.deltacommit.inflight
> -rw-r--r--   1 alexey.kudinkin  staff     0 Jan 25 12:57 
> 

[jira] [Updated] (HUDI-3330) TestHoodieDeltaStreamerWithMultiWriter: Use HoodieTestDataGenerator to generate backfill dataset

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3330:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> TestHoodieDeltaStreamerWithMultiWriter: Use HoodieTestDataGenerator to 
> generate backfill dataset
> 
>
> Key: HUDI-3330
> URL: https://issues.apache.org/jira/browse/HUDI-3330
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Manoj Govindassamy
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> TestHoodieDeltaStreamerWithMultiWriter uses zip artifacts with pre-generated 
> dataset for its backfill jobs. Anytime the metadata table schema changes, the 
> records in the artifacts need to be regenerated. Better if we can do the 
> standard way of dataset generation using HoodieTestDataGenerator for this 
> test.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3177) Support CREATE INDEX statement

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3177:
--
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24)

> Support CREATE INDEX statement
> --
>
> Key: HUDI-3177
> URL: https://issues.apache.org/jira/browse/HUDI-3177
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index, metadata
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Users should be able to trigger index creation using CREATE INDEX statement 
> for one or more partitions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3352) Rebase `HoodieWriteHandle` to use `HoodieRecord`

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3352:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Rebase `HoodieWriteHandle` to use `HoodieRecord`
> 
>
> Key: HUDI-3352
> URL: https://issues.apache.org/jira/browse/HUDI-3352
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> From RFC-46:
> `HoodieWriteHandle`s will be  
>    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro 
> conversion)
>    2. Using Combining API engine to merge records (when necessary) 
>    3. Passes `HoodieRecord` as is to `FileWriter`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3341) Investigate that metadata table cannot be read for hadoop-aws 2.7.x

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3341:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Investigate that metadata table cannot be read for hadoop-aws 2.7.x
> ---
>
> Key: HUDI-3341
> URL: https://issues.apache.org/jira/browse/HUDI-3341
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3280) Clean up unused/deprecated methods

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3280:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Clean up unused/deprecated methods
> --
>
> Key: HUDI-3280
> URL: https://issues.apache.org/jira/browse/HUDI-3280
> Project: Apache Hudi
>  Issue Type: Task
>  Components: reader-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Clean up unused/deprecated methods as well as additional validations in 
>  * HoodieInputFormatUtils



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2584) Unit tests for bloom filter index based out of metadata table.

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2584:
--
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-3, 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Unit tests for bloom filter index based out of metadata table. 
> ---
>
> Key: HUDI-2584
> URL: https://issues.apache.org/jira/browse/HUDI-2584
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: release-blocker
> Fix For: 0.11.0
>
>
> Test Bloom filter based out of metadata table.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3042) Refactor clustering action in hudi-client module to use HoodieData abstraction

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3042:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Refactor clustering action in hudi-client module to use HoodieData abstraction
> --
>
> Key: HUDI-3042
> URL: https://issues.apache.org/jira/browse/HUDI-3042
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: sev:high
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1127) Handling late arriving Deletes

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-1127:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Handling late arriving Deletes
> --
>
> Key: HUDI-1127
> URL: https://issues.apache.org/jira/browse/HUDI-1127
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer, writer-core
>Affects Versions: 0.9.0
>Reporter: Bhavani Sudha
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: sev:high
> Fix For: 0.11.0
>
>
> Recently I was working on a [PR|https://github.com/apache/hudi/pull/1704] to 
> enhance OverwriteWithLatestAvroPayload class to consider records in storage 
> when merging. Briefly, this class will ignore older updates if the record in 
> storage is the latest one ( based on the Precombine field). 
> Based on this, the expectation is that we handle any write operation that 
> should be dealt with the same way - if they are older they should be ignored. 
> While at this, I identified that we cannot handle all Deletes the same way. 
> This is because we process deletes in two ways mainly -
>  * by adding and enabling a metadata field  `_hoodie_is_deleted` to our in 
> the original record and sending it as an UPSERT operation.
>  * by using an empty payload using the EmptyHoodieRecordPayload and sending 
> the write as a DELETE operation. 
> While the former has ordering field and can be processed as expected (older 
> deletes will be ignored), the later does not have any ordering field to 
> identify if its an older delete or not and hence will let the older delete to 
> go through.
> Just opening this issue to track this gap. We would need to identify what is 
> the right choice here and fix as needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3276) Make HoodieParquetInputFormat extend MapredParquetInputFormat again

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3276:
--
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: 
Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Make HoodieParquetInputFormat extend MapredParquetInputFormat again
> ---
>
> Key: HUDI-3276
> URL: https://issues.apache.org/jira/browse/HUDI-3276
> Project: Apache Hudi
>  Issue Type: Task
>  Components: hive, reader-core
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3239) Convert AbstractHoodieTableFileIndex to Java

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3239:
--
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: 
Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Convert AbstractHoodieTableFileIndex to Java
> 
>
> Key: HUDI-3239
> URL: https://issues.apache.org/jira/browse/HUDI-3239
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Since it has been extracted from `HoodieFileIndex`, path-of-least-resistance 
> was taken down to keep it in Scala for the time being.
>  
> This brings unnecessary dependency on Scala into Hive bundles. We should 
> convert it to Java.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2732) Spark Datasource V2 integration RFC

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2732:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Spark Datasource V2 integration RFC 
> 
>
> Key: HUDI-2732
> URL: https://issues.apache.org/jira/browse/HUDI-2732
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Reporter: leesf
>Assignee: leesf
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3254) Introduce HoodieCatalog to manage tables for Spark Datasource V2

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3254:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Introduce HoodieCatalog to manage tables for Spark Datasource V2
> 
>
> Key: HUDI-3254
> URL: https://issues.apache.org/jira/browse/HUDI-3254
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark
>Reporter: leesf
>Assignee: leesf
>Priority: Blocker
>  Labels: pull-request-available, sev:normal
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2930) Rollbacks are not archived when metadata table is enabled

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2930:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Rollbacks are not archived when metadata table is enabled
> -
>
> Key: HUDI-2930
> URL: https://issues.apache.org/jira/browse/HUDI-2930
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> I run bulk inserts into COW table using DeltaStreamer continuous mode and I 
> observed that the rollbacks are not archived.  There were commits in between 
> these old rollbacks but after the archival process kicks in, the old 
> rollbacks are still in the active timeline while the other commits are 
> archived.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2973) Rewrite/re-publish RFC for Data skipping index

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2973:
--
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: 
Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Rewrite/re-publish RFC for Data skipping index
> --
>
> Key: HUDI-2973
> URL: https://issues.apache.org/jira/browse/HUDI-2973
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3207) Hudi Trino connector PR review

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3207:
--
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24)

> Hudi Trino connector PR review
> --
>
> Key: HUDI-3207
> URL: https://issues.apache.org/jira/browse/HUDI-3207
> Project: Apache Hudi
>  Issue Type: Task
>  Components: trino-presto
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> https://github.com/trinodb/trino/pull/10228



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3074) Docs for Z-order

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3074:
--
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-3, 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Docs for Z-order
> 
>
> Key: HUDI-3074
> URL: https://issues.apache.org/jira/browse/HUDI-3074
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, docs
>Reporter: Kyle Weller
>Assignee: Kyle Weller
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1370) Scoping work needed to support bootstrapped data table and RFC-15 together

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-1370:
--
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: 
Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Scoping work needed to support bootstrapped data table and RFC-15 together
> --
>
> Key: HUDI-1370
> URL: https://issues.apache.org/jira/browse/HUDI-1370
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Common Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2751) To avoid the duplicates for streaming read MOR table

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2751:
--
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: 
Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> To avoid the duplicates for streaming read MOR table
> 
>
> Key: HUDI-2751
> URL: https://issues.apache.org/jira/browse/HUDI-2751
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Common Core
>Reporter: Danny Chen
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Imagine there are commits on the timeline:
> inflight compaction complete compaction
> | |
> {code:java}
> -instant 99 - instant 100 - 101 — 102 -- instant 100 --
> first read ->| second read ->|
> – range 1 | --range 2 ---|
>   {code}
> instant 99, 101, 102 are successful non-compaction delta commits;
> instant 100 is compaction instant,
> the first inc read consumes to instant 99 and the second read consumes from 
> instant 100 to instant 102, the second read would consumes the commit files 
> of instant 100 which has already been consumed before.
> The duplicate reading happens when this condition triggers: a compaction 
> instant schedules then completes in *one* consume range.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1180) Upgrade HBase to 2.x

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-1180:
--
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: 
Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Upgrade HBase to 2.x
> 
>
> Key: HUDI-1180
> URL: https://issues.apache.org/jira/browse/HUDI-1180
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Wenning Ding
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Trying to upgrade HBase to 2.3.3 but ran into several issues.
> According to the Hadoop version support matrix: 
> [http://hbase.apache.org/book.html#hadoop], also need to upgrade Hadoop to 
> 2.8.5+.
>  
> There are several API conflicts between HBase 2.2.3 and HBase 1.2.3, we need 
> to resolve this first. After resolving conflicts, I am able to compile it but 
> then I ran into a tricky jetty version issue during the testing:
> {code:java}
> [ERROR] TestHBaseIndex.testDelete()  Time elapsed: 4.705 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdate()  Time elapsed: 0.174 
> s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdateWithRollback()  Time 
> elapsed: 0.076 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSmallBatchSize()  Time elapsed: 0.122 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTagLocationAndDuplicateUpdate()  Time elapsed: 
> 0.16 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalGetsBatching()  Time elapsed: 1.771 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalPutsBatching()  Time elapsed: 0.082 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> 34206 [Thread-260] WARN  
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner  - DirectoryScanner: 
> shutdown has been called
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager  - 
> IncrementalBlockReportManager interrupted
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.DataNode  - Ending block pool service 
> for: Block pool BP-1058834949-10.0.0.2-1597189606506 (Datanode Uuid 
> cb7bd8aa-5d79-4955-b1ec-bdaf7f1b6431) service to localhost/127.0.0.1:55924
> 34246 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data1/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 34247 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data2/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 37192 [HBase-Metrics2-1] WARN  org.apache.hadoop.metrics2.impl.MetricsConfig  
> - Cannot locate configuration: tried 
> hadoop-metrics2-datanode.properties,hadoop-metrics2.properties
> 43904 
> [master/iad1-ws-cor-r12:0:becomeActiveMaster-SendThread(localhost:58768)] 
> WARN  org.apache.zookeeper.ClientCnxn  - Session 0x173dfeb0c8b0004 for server 
> null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> [INFO] 
> [INFO] Results:
> [INFO] 
> [ERROR] Errors: 
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   

[jira] [Updated] (HUDI-431) Support Parquet in MOR log files

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-431:
-
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, 
Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: 
Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24)

> Support Parquet in MOR log files
> 
>
> Key: HUDI-431
> URL: https://issues.apache.org/jira/browse/HUDI-431
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: storage-management
>Reporter: sivabalan narayanan
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: help-requested, pull-request-available
> Fix For: 0.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> We have a basic implementation of inline filesystem, to read a file format 
> like Parquet, embedded "inline" into another file.  
> [https://github.com/apache/hudi/blob/master/hudi-common/src/test/java/org/apache/hudi/common/fs/inline/TestInLineFileSystem.java]
>  for sample usage.
>  This idea here is to see if we can embed parquet/hfile formats into the Hudi 
> log files, to get columnar reads on the delta log files as well. This helps 
> us speed up query performance, given the log is row based today. Once Inline 
> FS is available, enable parquet logging support with HoodieLogFile. LogFile 
> can expose a writer (essentially ParquetWriter) and users can write records 
> as though writing to parquet files. Similarly on the read path, a reader 
> (parquetReader) will be exposed which the user can use to read data out of 
> it. 
> This Jira tracks work to implement such parquet inlining into the log format 
> and have the writer and reader use it. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2432) Fix restore by adding a requested instant and restore plan

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2432:
--
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-3, 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Fix restore by adding a requested instant and restore plan
> --
>
> Key: HUDI-2432
> URL: https://issues.apache.org/jira/browse/HUDI-2432
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Fix restore by adding a requested instant and restore plan
>  
> Trying to see if we really need a plan. Dumping my thoughts here. 
> Restore internally converts to N no of rollbacks. We fetch active instants in 
> reverse order from timeline and trigger rollbacks 1 by 1. We have already 
> have a patch fixing rollback to add rollback Plan in rollback.requested meta 
> file. So, walking through failure scenarios. 
>  
> With restore, individual rollbacks are not published to timeline. So, if 
> restore fails midway, in the 2nd attempt, only subset of rollback will be 
> applied to metadata table(which got rolledback during the 2nd attempt). so, 
> we need a plan for restore as well.
> But with our enhancement to rollback to publish a plan, Rollback.requested 
> can't be skipped and we have to publish to timeline. So, here is what will 
> happen w/o a restore plan.
>  
> start restore
>     rollback commit N
>           rollback.requested for commit N// plan.
>           execute rollback, but do not publish to timeline. so this will not 
> get applied to metadata table. 
>     rollback commit N-1
>            rollback.requested for commit N-1 // plan
>           execute rollback, but do not publish to timeline. again, will not 
> get applied to metadata table. 
>      .
> commit restore and publish. this will get applied to metadata table. 
> Once we are done committing restore, we can remove all rollback.requested 
> files if needed. 
>  
> Failure scenarios: 
> If after 2 rollbacks, we fail. 
> on re-attempt, we will process remaining commits only, since active timeline 
> may not report commitN and commitN-1 as active. So, we can do something like 
> below w/ a restore plan.
>  
> 1. start restore
>    2. schedule rollback for all of them. 
>         serialize all commit instants that need to be rolledback along with 
> the rollback plan. // by now, we would have created rollback.requested meta 
> file for all commits that need to be rolled back. 
>     3. now execute rollback one by one. // do not publish to timeline once 
> done. also changes should not be applied to metadata table. 
> 4. collect rollback commit metadata from all individual rollbacks and create 
> the restore commit metadata. there could be some commits which was already 
> rolledback, and for those, we need to manually create rollback metadata based 
> on rollback plan. More details in next para. commit the restore and publish. 
> only this will get applied to metadata table(which inturn will unwrap the 
> individual rollback metadata and apply it to metadata table). 
>  
> Failures:
> if we fail after 2nd rollback:
> on 2nd attempt, we will look at retstore plan for all commits that needs to 
> be rolledback. So, we can't really rollback the first 2 since they are 
> already rolled back. And so, we will manually create rollback metadata from 
> rollback.requested meta file. and for rest, we will follow the regular flow 
> of executing actual rollback and collecting rollback metadata. Once complete, 
> we will serialize all this info in restore metadata which gets applied to 
> metadata table. 
>  
> Alternatives: But since restore anyway is a destructive operation and is 
> advised to stop all processes, we do have an option to clean up metadata 
> table and rebootstrap completely once restore is complete. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1295) Implement: Metadata based bloom index - write path

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-1295:
--
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-3, 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Implement: Metadata based bloom index - write path
> --
>
> Key: HUDI-1295
> URL: https://issues.apache.org/jira/browse/HUDI-1295
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Idea here to maintain our bloom filters outside of parquet for speedier 
> access from bloom.
>  
> - Design and impl bloom filter migration to metadata table. 
> Design:
> schema for the payload: 
> key: partitionName_fileName
> payload schema:
> isDeleted (boolean): true/false
> bloom_type: short
> ser_bloom: byte[] representing serialized bloom filter. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-1296) Implement Spark DataSource using range metadata for file/partition pruning

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-1296:
--
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24)

> Implement Spark DataSource using range metadata for file/partition pruning
> --
>
> Key: HUDI-1296
> URL: https://issues.apache.org/jira/browse/HUDI-1296
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3349) Replace usages of `HoodieRecordPayload` w/ `HoodieRecord`

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3349:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Replace usages of `HoodieRecordPayload` w/ `HoodieRecord`
> -
>
> Key: HUDI-3349
> URL: https://issues.apache.org/jira/browse/HUDI-3349
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> From RFC-46:
> To promote `HoodieRecord` to become a standardized API of interacting with a 
> single record, we need:
>  # Rebase usages of `HoodieRecordPayload` w/ `HoodieRecord`
>  # Implement new standardized record-level APIs (like `getPartitionKey` , 
> `getRecordKey`, etc) in `HoodieRecord`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3166) Implement new HoodieIndex based on metadata indices

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3166:
--
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24)

> Implement new HoodieIndex based on metadata indices 
> 
>
> Key: HUDI-3166
> URL: https://issues.apache.org/jira/browse/HUDI-3166
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index, metadata
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: metadata
> Fix For: 0.11.0
>
>
> A new HoodieIndex implementation working based off indices from the metadata 
> table. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3181) Address test failures after enabling metadata index for bloom filters and column stats

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3181:
--
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-3, 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Address test failures after enabling metadata index for bloom filters and 
> column stats
> --
>
> Key: HUDI-3181
> URL: https://issues.apache.org/jira/browse/HUDI-3181
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2589) RFC: Metadata based index for bloom filter and column stats

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2589:
--
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: 
Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> RFC: Metadata based index for bloom filter and column stats
> ---
>
> Key: HUDI-2589
> URL: https://issues.apache.org/jira/browse/HUDI-2589
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3343) Metadata Table includes Uncommitted Log Files during Bootstrap

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3343:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Metadata Table includes Uncommitted Log Files during Bootstrap
> --
>
> Key: HUDI-3343
> URL: https://issues.apache.org/jira/browse/HUDI-3343
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> While working on a fix for HUDI-3322, discovered a following issue:
> If we're bootstrapping the MT during pending Rollback operation (this could 
> happen when previous writer had MT *disabled* when writing the data), since 
> bootstrapping is done _after_ Rollback is executed (with its side-effects 
> already being reflected on FS) bootstrapping would incorrectly include 
> intermediary files created by the Rollback (like log-files being created with 
> Rollback Command Block appended).
>  
> Filtering of the files is performed here: 
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java#L752
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3225) RFC for Async Metadata Index

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3225:
--
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24)

> RFC for Async Metadata Index
> 
>
> Key: HUDI-3225
> URL: https://issues.apache.org/jira/browse/HUDI-3225
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3246) Blog on Kafka Connect Sink for Hudi

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3246:
--
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24)

> Blog on Kafka Connect Sink for Hudi
> ---
>
> Key: HUDI-3246
> URL: https://issues.apache.org/jira/browse/HUDI-3246
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs, kafka-connect
>Reporter: Ethan Guo
>Assignee: Rajesh Mahindra
>Priority: Blocker
>  Labels: kafka-connect
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3275) Add tests for async metadata indexing

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3275:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Add tests for async metadata indexing
> -
>
> Key: HUDI-3275
> URL: https://issues.apache.org/jira/browse/HUDI-3275
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index, metadata
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3221) Support querying a table as of a savepoint

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3221:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Support querying a table as of a savepoint
> --
>
> Key: HUDI-3221
> URL: https://issues.apache.org/jira/browse/HUDI-3221
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: hive, reader-core, spark, writer-core
>Reporter: Ethan Guo
>Assignee: Forward Xu
>Priority: Blocker
>  Labels: pull-request-available, user-support-issues
> Fix For: 0.11.0
>
>
> Right now point-in-time queries are limited to what's retained by the 
> cleaner. If we fix this and expose via SQL, then it's a gap we close.
> Dataframe read path support this option but not for SQL read path
> [https://hudi.apache.org/docs/quick-start-guide/#time-travel-query]
> SparkSQL Syntax
> {code:java}
> // code placeholder
> SELECT * FROM A.B TIMESTAMP AS OF 1643119574;
> SELECT * FROM A.B TIMESTAMP AS OF '2019-01-29 00:37:58';
> SELECT * FROM A.B VERSION AS OF 'Snapshot123456789';{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3088) Make Spark 3 the default profile for build and test

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3088:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Make Spark 3 the default profile for build and test
> ---
>
> Key: HUDI-3088
> URL: https://issues.apache.org/jira/browse/HUDI-3088
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> By default, when people check out the code, they should have activated spark 
> 3 for the repo. Also all tests should be running against the latest supported 
> spark version. Correspondingly the default scala version becomes 2.12 and the 
> default parquet version 1.12.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2809) Introduce a checksum mechanism for validating hoodie.properties

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2809:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Introduce a checksum mechanism for validating hoodie.properties
> ---
>
> Key: HUDI-2809
> URL: https://issues.apache.org/jira/browse/HUDI-2809
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Vinoth Chandar
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Idea here is to add a 
> {_}hoodie.checksum=
> entry as the last value of hoodie.properties and throw an error if it does 
> not validate. this is to guard partial writes on HDFS
>  
> Main implementation issue is use of Properties which is a hashtable, so the 
> entry is not added as the last value necessarily. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2656) Generalize HoodieIndex for flexible record data type

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2656:
--
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: 
Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Generalize HoodieIndex for flexible record data type
> 
>
> Key: HUDI-2656
> URL: https://issues.apache.org/jira/browse/HUDI-2656
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2925) Cleaner may attempt to delete the same file twice when metadata table is enabled

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2925:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Cleaner may attempt to delete the same file twice when metadata table is 
> enabled
> 
>
> Key: HUDI-2925
> URL: https://issues.apache.org/jira/browse/HUDI-2925
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: core-flow-ds, pull-request-available, sev:high
> Fix For: 0.11.0
>
>
> This issue happens only when TimelineServer is disabled (reason in next 
> comment). Our pipelines execute a write (insert or upsert) along with an 
> asynchronous clean. Metadata table is enabled.
>  
> Assume the timelines are as follows:
> Dataset:   100.commit        101.commit   102.clean.inflight
> Metadata: 100.deltacomit  
> (this happened as the pipeline failed due to non-HUDI  issues which executing 
> 101 and 102)
>  
> In the next run of the pipeline some more data is available  so a commit will 
> take place (103.commit.requested). Along with it, an asynchronous clean 
> starts (104.clean.requested). The [BaseCleanActionExecutor detected 
> previously unfinished 
> clean|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java#L231]
>  (102.clean.inflight) and attempts to do it first. So the order of cleans 
> will be 102.clean followed by 104.clean.
>  
> 102.clean => Suppose this deletes files from 90.commit
> 104.clean  => This should delete files from 91.commit
>  
> The issue is that while executing 104.clean, the filesystemview is still the 
> one which was used during 102.clean (i.e. post clean the file system view is 
> not synced). When metadata table is enabled, HoodieMetadataFileSystemView is 
> used which has the metadata reader inside it. This metadata reader opens the 
> metadata table at a particular time instant (will be 101.commit as that was 
> the last completed action). Even after 102.clean is completed, the 
> HoodieMetadataFileSystemView is still using the cached metadata reader. 
> Hence, the reader still returns files from 90.commit which have already been 
> deleted by 102.clean.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2965) Fix layout optimization to appropriately handle nested columns references

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2965:
--
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-3, 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Fix layout optimization to appropriately handle nested columns references
> -
>
> Key: HUDI-2965
> URL: https://issues.apache.org/jira/browse/HUDI-2965
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Currently Layout Optimization does only work for top-level columns being 
> specified as columns to be orderedBy.
>  
> We need to make sure it works correctly for the case when the the field 
> reference is specified in the configuration as well (like "a.b.c", 
> referencing the field `c` w/in `b` sub-object of the top-level "c" column)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3175) Support INDEX action for async metadata indexing

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3175:
--
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24)

> Support INDEX action for async metadata indexing
> 
>
> Key: HUDI-3175
> URL: https://issues.apache.org/jira/browse/HUDI-3175
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index, metadata
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: metadata, pull-request-available
> Fix For: 0.11.0
>
>
> Add a new WriteOperationType and handle conflicts with concurrent writer or 
> any other async table service. Implement the protocol in HUDI-2488



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-512) Support for Index functions on columns to generate logical or micro partitioning

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-512:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Support for Index functions on columns to generate logical or micro 
> partitioning
> 
>
> Key: HUDI-512
> URL: https://issues.apache.org/jira/browse/HUDI-512
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Common Core
>Affects Versions: 0.9.0
>Reporter: Alexander Filipchik
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: features
> Fix For: 0.11.0
>
>
> This one is more inspirational, but, I believe, will be very useful. 
> Currently hudi is following Hive table format, which means that data is 
> logically and physically partitioned into folder structure like:
> table_name
>   2019
>     01
>     02
>        bla.parquet
>  
> This has several issues:
>  1) Modern object stores (AWS S3, GCP) are more performant when each file 
> name starts with some kind of a random value. By definition Hive layout is 
> not perfect
> 2) Hive Metastore stores partitions in the text field in the single table (2 
> tables with very similar information) and doesn't support proper filtering. 
> Data partitioned by day will be stored like:
> 2019/01/10
> 2019/01/11
> so only regexp queries are suported (at least in Hive 2.X.X)
> 3) Having a single POF which relies on non distributed DB is dangerous and 
> creates bottlenecks. 
>  
> The idea is to get rid of logical partitioning all together (and hive 
> metastore as well). If dataset has a time columns, user should be able to 
> query it without understanding what is the physical layout of the table (by 
> specifying those partitions explicitly or ending up with a full table scan 
> accidentally).
> It will require some kind of mapping of time to file locations (similar to 
> Iceberg). I'm also leaning towards the idea that storing table metadata with 
> the table is a good thing as it can be read by the engine in one shot and 
> will be faster that taxing a standalone metastore. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2961) Async table services can race with metadata table updates

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2961:
--
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-24)

> Async table services can race with metadata table updates
> -
>
> Key: HUDI-2961
> URL: https://issues.apache.org/jira/browse/HUDI-2961
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Manoj Govindassamy
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Today Metadata table updates are done inline/synchronous with the data table 
> updates. Metadata data table updates can also sometime trigger table services 
> like compaction which are also done inline w.r.t the ongoing commit. So, 
> updates in the metadata table are always serial. However, there can be async 
> table services like clustering which are running in parallel with single or 
> multiple writers and can update the metadata table in parallel with the 
> writer commits. 
> In the multi writer case, since we anyway have the lock provider configured 
> metadata table updates are guarded for race. But, the lock providers are not 
> must today for single writer + async table service deployments, leading to 
> race in metadata table updates. Async table service like clustering can race 
> with the metadata table compaction, and can update the wrong delta log file 
> than the right next delta file from the compaction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2370) Supports data encryption

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2370:
--
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24)

> Supports data encryption
> 
>
> Key: HUDI-2370
> URL: https://issues.apache.org/jira/browse/HUDI-2370
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Data security is becoming more and more important, if hudi can support 
> encryption, it is very welcome
> 1. Specify column encryption
>  2. Support footer encryption
>  3. Custom encrypted client interface(Provide memory-based encryption client 
> by default)
> 4. Specify the encryption key
>  
> When querying, you need to pass the relevant key or obtain query permission 
> based on the client's encrypted interface. If it fails, the result cannot be 
> returned.
>  1. When querying non-encrypted fields, the key is not passed, and the data 
> is returned normally
>  2. When querying encrypted fields, the key is not passed and the data is not 
> returned
>  3. When the encrypted field is queried, the key is passed, and the data is 
> returned normally
>  4. When querying all fields, the key is not passed and no result is 
> returned. If passed, the data returns normally
>  
> Start with COW first



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2695) Documentation

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-2695:
--
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-3, 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Documentation
> -
>
> Key: HUDI-2695
> URL: https://issues.apache.org/jira/browse/HUDI-2695
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Kyle Weller
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3075) Docs for Debezium source

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3075:
--
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-3, 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Docs for Debezium source
> 
>
> Key: HUDI-3075
> URL: https://issues.apache.org/jira/browse/HUDI-3075
> Project: Apache Hudi
>  Issue Type: Task
>  Components: deltastreamer, docs
>Reporter: Kyle Weller
>Assignee: Kyle Weller
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3297) Write down details of all cases to consider and test for deploying metadata table

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3297:
--
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31  (was: 
Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24)

> Write down details of all cases to consider and test for deploying metadata 
> table
> -
>
> Key: HUDI-3297
> URL: https://issues.apache.org/jira/browse/HUDI-3297
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3208) Come up with rollout plan for enabling metadata table by default in 0.11

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-3208:
--
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31  (was: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24)

> Come up with rollout plan for enabling metadata table by default in 0.11
> 
>
> Key: HUDI-3208
> URL: https://issues.apache.org/jira/browse/HUDI-3208
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata, writer-core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Code-level 
>  * We should throw errors if lock provider is not configured
>  * At no point, should we lead to unwitting users to corrupt their tables
> Docs, get community feedback on any proposal. 
> Then we start testing this more. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4728: [WIP][HUDI-2973] RFC-27: Data skipping index to improve query performance

2022-01-31 Thread GitBox


hudi-bot commented on pull request #4728:
URL: https://github.com/apache/hudi/pull/4728#issuecomment-1026468454


   
   ## CI report:
   
   * 7514b7f93324e7bb9affa1df3346085e3dcdd682 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5637)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4728: [WIP][HUDI-2973] RFC-27: Data skipping index to improve query performance

2022-01-31 Thread GitBox


hudi-bot removed a comment on pull request #4728:
URL: https://github.com/apache/hudi/pull/4728#issuecomment-1026420793


   
   ## CI report:
   
   * 7514b7f93324e7bb9affa1df3346085e3dcdd682 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5637)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-3316) HoodieColumnRangeMetadata doesn't include all statistics for the column

2022-01-31 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra closed HUDI-3316.
-
Resolution: Fixed

> HoodieColumnRangeMetadata doesn't include all statistics for the column
> ---
>
> Key: HUDI-3316
> URL: https://issues.apache.org/jira/browse/HUDI-3316
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>
> HoodieColumnChunkMetadata includes the following stats about a parquet column
>  * columnName;
>  * minValue
>  * maxValue
>  * numNulls
>  
> Parquet's ColumnChunkMetaData do have more stats and we need to include them 
> all in our index 
>  * num values 
>  * total size
>  * total uncompressed size



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-01-31 Thread GitBox


zhangyue19921010 commented on a change in pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#discussion_r796251580



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -0,0 +1,337 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.async.HoodieAsyncService;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.Collectors;
+
+public class HoodieMetadataTableValidator {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieMetadataTableValidator.class);
+
+  // Spark context
+  private  transient JavaSparkContext jsc;
+  // config
+  private  Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;
+
+  private HoodieTableMetaClient metaClient;
+
+  protected transient Option 
asyncMetadataTableValidateService;
+
+  public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) {
+this.metaClient = metaClient;
+  }
+
+  public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) {
+this.jsc = jsc;
+this.cfg = cfg;
+
+this.props = cfg.propsFilePath == null
+? UtilHelpers.buildProperties(cfg.configs)
+: readConfigFromFileSystem(jsc, cfg);
+
+this.metaClient = HoodieTableMetaClient.builder()
+.setConf(jsc.hadoopConfiguration()).setBasePath(cfg.basePath)
+.setLoadActiveTimelineOnLoad(true)
+.build();
+
+this.asyncMetadataTableValidateService = 
cfg.runningMode.equalsIgnoreCase(Mode.CONTINUOUS.name())
+? Option.of(new AsyncMetadataTableValidateService()) : Option.empty();
+  }
+
+  /**
+   * Reads config from the file system.
+   *
+   * @param jsc {@link JavaSparkContext} instance.
+   * @param cfg {@link Config} instance.
+   * @return the {@link TypedProperties} instance.
+   */
+  private TypedProperties readConfigFromFileSystem(JavaSparkContext jsc, 
Config cfg) {
+return UtilHelpers.readConfig(jsc.hadoopConfiguration(), new 
Path(cfg.propsFilePath), cfg.configs)
+.getProps(true);
+  }
+
+  public enum Mode {
+// Running MetadataTableValidator in continuous
+CONTINUOUS,
+// Running MetadataTableValidator once
+ONCE
+  }
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--base-path", "-sp"}, description = "Base path for 
the table", required = true)
+public String basePath = null;
+
+@Parameter(names = {"--mode", "-m"}, description = "Set job mode: "
++ "Set \"CONTINUOUS\" Running MetadataTableValidator in continuous"
++ "Set \"ONCE\" Running MetadataTableValidator once", required = false)
+public String runningMode = "once";
+
+@Parameter(names = {"--min-validate-interval-seconds"},
+description = "the min validate interval of each validate in 
continuous mode")
+public Integer minValidateIntervalSeconds = 10;
+
+@Parameter(names = {"--ignore-failed", "-ig"}, description 

[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-01-31 Thread GitBox


zhangyue19921010 commented on a change in pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#discussion_r796251289



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -0,0 +1,337 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.async.HoodieAsyncService;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.Collectors;
+
+public class HoodieMetadataTableValidator {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieMetadataTableValidator.class);
+
+  // Spark context
+  private  transient JavaSparkContext jsc;
+  // config
+  private  Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;
+
+  private HoodieTableMetaClient metaClient;
+
+  protected transient Option 
asyncMetadataTableValidateService;
+
+  public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) {
+this.metaClient = metaClient;
+  }
+
+  public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) {
+this.jsc = jsc;
+this.cfg = cfg;
+
+this.props = cfg.propsFilePath == null
+? UtilHelpers.buildProperties(cfg.configs)
+: readConfigFromFileSystem(jsc, cfg);
+
+this.metaClient = HoodieTableMetaClient.builder()
+.setConf(jsc.hadoopConfiguration()).setBasePath(cfg.basePath)
+.setLoadActiveTimelineOnLoad(true)
+.build();
+
+this.asyncMetadataTableValidateService = 
cfg.runningMode.equalsIgnoreCase(Mode.CONTINUOUS.name())
+? Option.of(new AsyncMetadataTableValidateService()) : Option.empty();
+  }
+
+  /**
+   * Reads config from the file system.
+   *
+   * @param jsc {@link JavaSparkContext} instance.
+   * @param cfg {@link Config} instance.
+   * @return the {@link TypedProperties} instance.
+   */
+  private TypedProperties readConfigFromFileSystem(JavaSparkContext jsc, 
Config cfg) {
+return UtilHelpers.readConfig(jsc.hadoopConfiguration(), new 
Path(cfg.propsFilePath), cfg.configs)
+.getProps(true);
+  }
+
+  public enum Mode {
+// Running MetadataTableValidator in continuous
+CONTINUOUS,
+// Running MetadataTableValidator once
+ONCE
+  }
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--base-path", "-sp"}, description = "Base path for 
the table", required = true)
+public String basePath = null;
+
+@Parameter(names = {"--mode", "-m"}, description = "Set job mode: "
++ "Set \"CONTINUOUS\" Running MetadataTableValidator in continuous"
++ "Set \"ONCE\" Running MetadataTableValidator once", required = false)
+public String runningMode = "once";
+
+@Parameter(names = {"--min-validate-interval-seconds"},
+description = "the min validate interval of each validate in 
continuous mode")
+public Integer minValidateIntervalSeconds = 10;
+
+@Parameter(names = {"--ignore-failed", "-ig"}, description 

[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-01-31 Thread GitBox


zhangyue19921010 commented on a change in pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#discussion_r796251222



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -0,0 +1,337 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.async.HoodieAsyncService;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.Collectors;
+
+public class HoodieMetadataTableValidator {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieMetadataTableValidator.class);
+
+  // Spark context
+  private  transient JavaSparkContext jsc;
+  // config
+  private  Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;
+
+  private HoodieTableMetaClient metaClient;
+
+  protected transient Option 
asyncMetadataTableValidateService;
+
+  public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) {
+this.metaClient = metaClient;
+  }
+
+  public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) {
+this.jsc = jsc;
+this.cfg = cfg;
+
+this.props = cfg.propsFilePath == null
+? UtilHelpers.buildProperties(cfg.configs)
+: readConfigFromFileSystem(jsc, cfg);
+
+this.metaClient = HoodieTableMetaClient.builder()
+.setConf(jsc.hadoopConfiguration()).setBasePath(cfg.basePath)
+.setLoadActiveTimelineOnLoad(true)
+.build();
+
+this.asyncMetadataTableValidateService = 
cfg.runningMode.equalsIgnoreCase(Mode.CONTINUOUS.name())
+? Option.of(new AsyncMetadataTableValidateService()) : Option.empty();
+  }
+
+  /**
+   * Reads config from the file system.
+   *
+   * @param jsc {@link JavaSparkContext} instance.
+   * @param cfg {@link Config} instance.
+   * @return the {@link TypedProperties} instance.
+   */
+  private TypedProperties readConfigFromFileSystem(JavaSparkContext jsc, 
Config cfg) {
+return UtilHelpers.readConfig(jsc.hadoopConfiguration(), new 
Path(cfg.propsFilePath), cfg.configs)
+.getProps(true);
+  }
+
+  public enum Mode {
+// Running MetadataTableValidator in continuous
+CONTINUOUS,
+// Running MetadataTableValidator once
+ONCE
+  }
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--base-path", "-sp"}, description = "Base path for 
the table", required = true)
+public String basePath = null;
+
+@Parameter(names = {"--mode", "-m"}, description = "Set job mode: "
++ "Set \"CONTINUOUS\" Running MetadataTableValidator in continuous"
++ "Set \"ONCE\" Running MetadataTableValidator once", required = false)
+public String runningMode = "once";

Review comment:
   Yeap, use ` Mode mode = Mode.valueOf(cfg.runningMode.toUpperCase());`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To 

[GitHub] [hudi] yanghua commented on pull request #4669: [HUDI-3239][Stacked on 4667] Convert `BaseHoodieTableFileIndex` to Java

2022-01-31 Thread GitBox


yanghua commented on pull request #4669:
URL: https://github.com/apache/hudi/pull/4669#issuecomment-1026461140


   > @yanghua there's a long stack of PRs, and there were issues w/ Metadata 
Table that were failing tests in the PR at the very bottom of it.
   > 
   > These issues have been addressed here in #4716, and as soon as it lands 
i'll rebase onto it.
   
   Sounds good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-01-31 Thread GitBox


zhangyue19921010 commented on a change in pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#discussion_r796251049



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieMetadataTableValidator.java
##
@@ -0,0 +1,337 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.async.HoodieAsyncService;
+import org.apache.hudi.client.common.HoodieSparkEngineContext;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.view.FileSystemViewManager;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.List;
+import java.util.Objects;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.stream.Collectors;
+
+public class HoodieMetadataTableValidator {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieMetadataTableValidator.class);
+
+  // Spark context
+  private  transient JavaSparkContext jsc;
+  // config
+  private  Config cfg;
+  // Properties with source, hoodie client, key generator etc.
+  private TypedProperties props;
+
+  private HoodieTableMetaClient metaClient;
+
+  protected transient Option 
asyncMetadataTableValidateService;
+
+  public HoodieMetadataTableValidator(HoodieTableMetaClient metaClient) {
+this.metaClient = metaClient;
+  }
+
+  public HoodieMetadataTableValidator(JavaSparkContext jsc, Config cfg) {
+this.jsc = jsc;
+this.cfg = cfg;
+
+this.props = cfg.propsFilePath == null
+? UtilHelpers.buildProperties(cfg.configs)
+: readConfigFromFileSystem(jsc, cfg);
+
+this.metaClient = HoodieTableMetaClient.builder()
+.setConf(jsc.hadoopConfiguration()).setBasePath(cfg.basePath)
+.setLoadActiveTimelineOnLoad(true)
+.build();
+
+this.asyncMetadataTableValidateService = 
cfg.runningMode.equalsIgnoreCase(Mode.CONTINUOUS.name())
+? Option.of(new AsyncMetadataTableValidateService()) : Option.empty();
+  }
+
+  /**
+   * Reads config from the file system.
+   *
+   * @param jsc {@link JavaSparkContext} instance.
+   * @param cfg {@link Config} instance.
+   * @return the {@link TypedProperties} instance.
+   */
+  private TypedProperties readConfigFromFileSystem(JavaSparkContext jsc, 
Config cfg) {
+return UtilHelpers.readConfig(jsc.hadoopConfiguration(), new 
Path(cfg.propsFilePath), cfg.configs)
+.getProps(true);
+  }
+
+  public enum Mode {
+// Running MetadataTableValidator in continuous
+CONTINUOUS,
+// Running MetadataTableValidator once
+ONCE
+  }
+
+  public static class Config implements Serializable {
+@Parameter(names = {"--base-path", "-sp"}, description = "Base path for 
the table", required = true)
+public String basePath = null;
+
+@Parameter(names = {"--mode", "-m"}, description = "Set job mode: "
++ "Set \"CONTINUOUS\" Running MetadataTableValidator in continuous"
++ "Set \"ONCE\" Running MetadataTableValidator once", required = false)
+public String runningMode = "once";
+
+@Parameter(names = {"--min-validate-interval-seconds"},
+description = "the min validate interval of each validate in 
continuous mode")
+public Integer minValidateIntervalSeconds = 10;

Review comment:
   Okay, change it to 10 minutes for 

[GitHub] [hudi] zhangyue19921010 commented on pull request #4721: [HUDI-3320] Hoodie metadata table validator

2022-01-31 Thread GitBox


zhangyue19921010 commented on pull request #4721:
URL: https://github.com/apache/hudi/pull/4721#issuecomment-1026460599


   > Did you get a chance to run this job and test it out. both once mode and 
continuous mode ?
   
   Sure, I tested this job on local env both once and continuous mode. Thanks a 
lot for your attention :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] manojpec edited a comment on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-01-31 Thread GitBox


manojpec edited a comment on pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#issuecomment-1026457070


   @alexeykudinkin 
   Can the test fixtures be concise? The json files are huge. Any ways to trim 
what we are testing here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] manojpec commented on pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-01-31 Thread GitBox


manojpec commented on pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#issuecomment-1026457070


   Can the test fixtures be concise? The json files are huge. Any ways to trim 
what we are testing here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #4712: [HUDI-2809] Introduce a checksum mechanism for validating hoodie.properties

2022-01-31 Thread GitBox


nsivabalan commented on pull request #4712:
URL: https://github.com/apache/hudi/pull/4712#issuecomment-1026456937


   btw, we might need to add the checksum property as part of the upgrade. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated (4b388c1 -> 7ce0f45)

2022-01-31 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 4b388c1  [HUDI-3292] Enabling lazy read by default for log blocks 
during compaction (#4661)
 add 7ce0f45  [HUDI-2711] Fallback to fulltable scan for 
IncrementalRelation if underlying files have been cleared or moved by cleaner 
(#3946)

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/hudi/DataSourceOptions.scala  |  4 +
 .../org/apache/hudi/IncrementalRelation.scala  | 86 +++--
 .../apache/hudi/functional/TestCOWDataSource.scala | 87 +-
 .../hudi/utilities/sources/HoodieIncrSource.java   |  5 +-
 .../functional/TestHoodieDeltaStreamer.java| 49 
 5 files changed, 205 insertions(+), 26 deletions(-)


[GitHub] [hudi] nsivabalan merged pull request #3946: [HUDI-2711] Fallback to fulltable scan for IncrementalRelation if underlying files have been cleared or moved by cleaner

2022-01-31 Thread GitBox


nsivabalan merged pull request #3946:
URL: https://github.com/apache/hudi/pull/3946


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #4662: [HUDI-3293] Fixing default value for clustering small file config

2022-01-31 Thread GitBox


nsivabalan commented on a change in pull request #4662:
URL: https://github.com/apache/hudi/pull/4662#discussion_r796248713



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java
##
@@ -83,7 +83,7 @@
 
   public static final ConfigProperty PLAN_STRATEGY_SMALL_FILE_LIMIT = 
ConfigProperty
   .key(CLUSTERING_STRATEGY_PARAM_PREFIX + "small.file.limit")
-  .defaultValue(String.valueOf(600 * 1024 * 1024L))
+  .defaultValue(String.valueOf(100 * 1024 * 1024L))

Review comment:
   sg. will make it 300MB then. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-3357) Implement the BigQuerySyncTool

2022-01-31 Thread Vinoth Govindarajan (Jira)
Vinoth Govindarajan created HUDI-3357:
-

 Summary: Implement the BigQuerySyncTool
 Key: HUDI-3357
 URL: https://issues.apache.org/jira/browse/HUDI-3357
 Project: Apache Hudi
  Issue Type: New Feature
  Components: hive-sync
Reporter: Vinoth Govindarajan


Implement the BigQuerySyncTool similar to HiveSyncTool which sync's the hudi 
table into BigQuery table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   3   4   >