[GitHub] [hudi] hudi-bot commented on pull request #3912: [HUDI-2665] Fix overflow of huge log file in HoodieLogFormatWriter

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3912:
URL: https://github.com/apache/hudi/pull/3912#issuecomment-962399326


   
   ## CI report:
   
   * 690d8bc00690e815f5a165b002fcf2ab84fb4d79 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3140)
 
   * b3ce9d1412d99e4ebe8207ed852681c110554e11 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3179)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3912: [HUDI-2665] Fix overflow of huge log file in HoodieLogFormatWriter

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3912:
URL: https://github.com/apache/hudi/pull/3912#issuecomment-962399131


   
   ## CI report:
   
   * 690d8bc00690e815f5a165b002fcf2ab84fb4d79 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3140)
 
   * b3ce9d1412d99e4ebe8207ed852681c110554e11 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-1658) [UMBRELLA] Spark Sql Support For Hudi

2021-11-05 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-1658:


Assignee: Yann Byron  (was: pengzhiwei)

> [UMBRELLA] Spark Sql Support For Hudi
> -
>
> Key: HUDI-1658
> URL: https://issues.apache.org/jira/browse/HUDI-1658
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Blocker
>  Labels: hudi-umbrellas
>
> This is the main task for supporting spark sql for hudi, including the 
> DDL、DML and Hoodie CLI command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1832) Support Hoodie CLI Command In Spark SQL

2021-11-05 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron reassigned HUDI-1832:


Assignee: Yann Byron

> Support Hoodie CLI Command In Spark SQL
> ---
>
> Key: HUDI-1832
> URL: https://issues.apache.org/jira/browse/HUDI-1832
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Major
>
> Move the Hoodie CLI command to spark sql. The syntax just like the follow:
> {code:java}
> CLI_COMMAND [ (param_key1 = value1, param_key2 = value2...) ]
> {code}
> e.g.
> {code:java}
> commits showcommit 
> showfiles (commit = ‘20210114221306’, limit = 10)show 
> rollbackssavepoint create (commit = ‘20210114221306’)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot commented on pull request #3912: [HUDI-2665] Fix overflow of huge log file in HoodieLogFormatWriter

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3912:
URL: https://github.com/apache/hudi/pull/3912#issuecomment-962399131


   
   ## CI report:
   
   * 690d8bc00690e815f5a165b002fcf2ab84fb4d79 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3140)
 
   * b3ce9d1412d99e4ebe8207ed852681c110554e11 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3912: [HUDI-2665] Fix overflow of huge log file in HoodieLogFormatWriter

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3912:
URL: https://github.com/apache/hudi/pull/3912#issuecomment-961588822


   
   ## CI report:
   
   * 690d8bc00690e815f5a165b002fcf2ab84fb4d79 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3140)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] guanziyue commented on pull request #3912: [HUDI-2665] Fix overflow of huge log file in HoodieLogFormatWriter

2021-11-05 Thread GitBox


guanziyue commented on pull request #3912:
URL: https://github.com/apache/hudi/pull/3912#issuecomment-962399151


   Fix checkSytle problems


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1885) Support Delete/Update Non-Pk Table

2021-11-05 Thread Yann Byron (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439602#comment-17439602
 ] 

Yann Byron commented on HUDI-1885:
--

I'll let users force to provide `recordKey` or `primaryKey` to make 
Delete/Update available, until write-core support this without pk.

> Support Delete/Update Non-Pk Table
> --
>
> Key: HUDI-1885
> URL: https://issues.apache.org/jira/browse/HUDI-1885
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Allow to delete/update a non-pk table.
> {code:java}
> create table h0 (
>   id int,
>   name string,
>   price double
> ) using hudi;
> delete from h0 where id = 10;
> update h0 set price = 10 where id = 12;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot removed a comment on pull request #3932: [HUDI-2704] Adding RFC-37 for Metadata based bloom index

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3932:
URL: https://github.com/apache/hudi/pull/3932#issuecomment-962393214


   
   ## CI report:
   
   * f584cff8c637e290c8e82464474d67f5508ea1de Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3177)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3932: [HUDI-2704] Adding RFC-37 for Metadata based bloom index

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3932:
URL: https://github.com/apache/hudi/pull/3932#issuecomment-962397869


   
   ## CI report:
   
   * f584cff8c637e290c8e82464474d67f5508ea1de Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3177)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

2021-11-05 Thread GitBox


prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744077361



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##
@@ -145,15 +152,39 @@ protected void commit(List records, String 
partitionName, String i
*
* The record is tagged with respective file slice's location based on its 
record key.
*/
-  private JavaRDD prepRecords(List records, String 
partitionName, int numFileGroups) {
+  private JavaRDD prepRecords(JavaRDD recordsRDD, 
String partitionName, int numFileGroups) {
 List fileSlices = 
HoodieTableMetadataUtil.loadPartitionFileGroupsWithLatestFileSlices(metadataMetaClient,
 partitionName);
 ValidationUtils.checkArgument(fileSlices.size() == numFileGroups, 
String.format("Invalid number of file groups: found=%d, required=%d", 
fileSlices.size(), numFileGroups));
 
-JavaSparkContext jsc = ((HoodieSparkEngineContext) 
engineContext).getJavaSparkContext();
-return jsc.parallelize(records, 1).map(r -> {
+return recordsRDD.map(r -> {
   FileSlice slice = 
fileSlices.get(HoodieTableMetadataUtil.mapRecordKeyToFileGroupIndex(r.getRecordKey(),
 numFileGroups));
   r.setCurrentLocation(new 
HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
   return r;
 });
   }
+
+  @Override
+  protected void commit(List partitionInfoList, String 
createInstantTime) {

Review comment:
   Looking into this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer

2021-11-05 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439600#comment-17439600
 ] 

Vinoth Chandar commented on HUDI-310:
-

This has changed hands quite a bit.  We still want to take this on.  Interested?

> DynamoDB/Kinesis Change Capture using Delta Streamer
> 
>
> Key: HUDI-310
> URL: https://issues.apache.org/jira/browse/HUDI-310
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>
> The goal here is to do CDC from DynamoDB and then have it be ingested into S3 
> as a Hudi dataset 
> Few resources: 
>  # DynamoDB Streams 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html]
>   provides change capture logs in Kinesis. 
>  # Walkthrough 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html]
>  Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] 
>  # Spark Streaming has support for reading Kinesis streams 
> [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one 
> of the many resources showing how to change the Spark Kinesis example code to 
> consume dynamodb stream   
> [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79]
>  # In DeltaStreamer, we need to add some form of KinesisSource that returns a 
> RDD with new data everytime `fetchNewData` is called 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java]
>   . DeltaStreamer itself does not use Spark Streaming APIs
>  # Internally, we have Avro, Json, Row sources that extract data in these 
> formats. 
> Open questions : 
>  # Should this just be a KinesisSource inside Hudi, that needs to be 
> configured differently or do we need two sources: DynamoDBKinesisSource (that 
> does some DynamoDB Stream specific setup/assumptions) and a plain 
> KinesisSource. What's more valuable to do , if we have to pick one. 
>  # For Kafka integration, we just reused the KafkaRDD in Spark Streaming 
> easily and avoided writing a lot of code by hand. Could we pull the same 
> thing off for Kinesis? (probably needs digging through Spark code) 
>  # What's the format of the data for DynamoDB streams? 
>  
>  
> We should probably flesh these out before going ahead with implementation? 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot removed a comment on pull request #3932: [HUDI-2704] Adding RFC-37 for Metadata based bloom index

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3932:
URL: https://github.com/apache/hudi/pull/3932#issuecomment-962393028


   
   ## CI report:
   
   * f584cff8c637e290c8e82464474d67f5508ea1de UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3932: [HUDI-2704] Adding RFC-37 for Metadata based bloom index

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3932:
URL: https://github.com/apache/hudi/pull/3932#issuecomment-962393214


   
   ## CI report:
   
   * f584cff8c637e290c8e82464474d67f5508ea1de Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3177)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3932: [HUDI-2704] Adding RFC-37 for Metadata based bloom index

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3932:
URL: https://github.com/apache/hudi/pull/3932#issuecomment-962393028


   
   ## CI report:
   
   * f584cff8c637e290c8e82464474d67f5508ea1de UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #3932: [HUDI-2704] Adding RFC-37 for Metadata based bloom index

2021-11-05 Thread GitBox


nsivabalan commented on pull request #3932:
URL: https://github.com/apache/hudi/pull/3932#issuecomment-962392948


   https://user-images.githubusercontent.com/513218/140597886-1a87a495-b358-4301-ac78-eb9aafc15433.png;>
   https://user-images.githubusercontent.com/513218/140597887-29cc850d-1bb6-4800-9dfe-a1617d3a0996.png;>
   https://user-images.githubusercontent.com/513218/140597888-8d458a21-4753-4674-82e7-75c24d8ff3bd.png;>
   https://user-images.githubusercontent.com/513218/140597889-981426d4-b753-4c07-a2d3-3f4df6cfd9cb.png;>
   https://user-images.githubusercontent.com/513218/140597890-0ae927c8-efb0-4b4f-a902-efb1a11aa4cb.png;>
   https://user-images.githubusercontent.com/513218/140597891-8311709d-6438-4b58-bb4a-d1413e733e3e.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2704) Create and publish RFC for metadata based bloom index

2021-11-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-2704:
-
Labels: pull-request-available  (was: )

> Create and publish RFC for metadata based bloom index
> -
>
> Key: HUDI-2704
> URL: https://issues.apache.org/jira/browse/HUDI-2704
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2704) Create and publish RFC for metadata based bloom index

2021-11-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-2704:
-

Assignee: sivabalan narayanan

> Create and publish RFC for metadata based bloom index
> -
>
> Key: HUDI-2704
> URL: https://issues.apache.org/jira/browse/HUDI-2704
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan opened a new pull request #3932: [HUDI-2704] Adding RFC-37 for Metadata based bloom index

2021-11-05 Thread GitBox


nsivabalan opened a new pull request #3932:
URL: https://github.com/apache/hudi/pull/3932


   ## What is the purpose of the pull request
   
   Adding an RFC for metadata based bloom index.
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2704) Create and publish RFC for metadata based bloom index

2021-11-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2704:
--
Fix Version/s: 0.10.0

> Create and publish RFC for metadata based bloom index
> -
>
> Key: HUDI-2704
> URL: https://issues.apache.org/jira/browse/HUDI-2704
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-2704) Create and publish RFC for metadata based bloom index

2021-11-05 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-2704:
-

 Summary: Create and publish RFC for metadata based bloom index
 Key: HUDI-2704
 URL: https://issues.apache.org/jira/browse/HUDI-2704
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Docs
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2704) Create and publish RFC for metadata based bloom index

2021-11-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2704:
--
Parent: HUDI-2703
Issue Type: Sub-task  (was: Improvement)

> Create and publish RFC for metadata based bloom index
> -
>
> Key: HUDI-2704
> URL: https://issues.apache.org/jira/browse/HUDI-2704
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: sivabalan narayanan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2475) Rolling Upgrade downgrade story for 0.10 & enabling metadata

2021-11-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2475:
-
Summary: Rolling Upgrade downgrade story for 0.10 & enabling metadata  
(was: Upgrade downgrade infra for enabling metadata)

> Rolling Upgrade downgrade story for 0.10 & enabling metadata
> 
>
> Key: HUDI-2475
> URL: https://issues.apache.org/jira/browse/HUDI-2475
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Upgrade downgrade infra for enabling metadata.
>  
> If user is having a writer process and clustering/compaction running async.
>  
>  - New synchronous metadata design, has a constraint that once metadata table 
> is bootstrapped, all commits will happen synchronously. In other words, there 
> is no catch up business wrt datatable.  
> So, it may not be feasible to do rolling upgrade (i.e. upgrade writer first 
> while async compaction is running) and then upgrade async compaction. 
> Bootstrap has to be done by stopping all processes and then we can restart 
> all other processes one by one (by using the upgraded hudi library) w/ 
> metadata enabled.  
> This is the only viable option I can think of. 
> 1. Stop all processes. Upgrade to hudi to a version w/ synchronous metadata. 
> bring up one writer process w/ metadata config enabled. this will bootstrap 
> the metadata table. and from there on, any new commits by the writer will do 
> synchronous updates to metadata.  
> Note: users can choose to upgrade via hudi-cli if need be. but easier would 
> be to just start the writer. Expect some delay for first commit since 
> bootstrap will be happening. 
> 2. Once first commit in previous writer process completes successfully, we 
> can restart all other processes. Upgrade the async table service (to hudi 
> version w/ metadata enabled) and restart it. *Ensure metadata table is 
> enabled across all processes.*  Even if missed on one, could result in data 
> loss.
>  
> By this, once metadata table is bootstrapped, any new commits from all 
> processes will be synced to metadata. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-2475) Upgrade downgrade infra for enabling metadata

2021-11-05 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439599#comment-17439599
 ] 

Vinoth Chandar commented on HUDI-2475:
--

I agree with your assessment above that for multi-writer deployments or 
async/separate deployments of cleaning/compaction/clustering. 

Writing this up per deployment type for clarity 
{quote}{color:#FF}Still WIP {color}
{quote}
h2. Deltastreamer continuous mode & Spark Streaming with in-writer process 
async table services, with no other writers

No action needed, rolling upgrades work just fine.

(I looked at the code to ensure the writer's write client gets created first, 
which will perform upgrade (delete existing metadata table, then rebuild it), 
before any async scheduling/execution happens)
h2. Multiple writers (or) Single writers with table services running 
out-of-writer process asynchronously

*1) Stop all writers and table services*
 - If metadata enabled,
 - if metadata enabled later. 

*2) Bring up writers serially, first writer will upgrade to 0.10*

*3) Deploy 0.10 to table services and redeploy*

*4) Do not enable metadata on query side until 1-3 are successfully completed*

As usual, it's recommended that readers are upgraded to 0.10 bundles prior to 
1-3
 - Need to check what errors are thrown if not across engines.

 

> Upgrade downgrade infra for enabling metadata
> -
>
> Key: HUDI-2475
> URL: https://issues.apache.org/jira/browse/HUDI-2475
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Upgrade downgrade infra for enabling metadata.
>  
> If user is having a writer process and clustering/compaction running async.
>  
>  - New synchronous metadata design, has a constraint that once metadata table 
> is bootstrapped, all commits will happen synchronously. In other words, there 
> is no catch up business wrt datatable.  
> So, it may not be feasible to do rolling upgrade (i.e. upgrade writer first 
> while async compaction is running) and then upgrade async compaction. 
> Bootstrap has to be done by stopping all processes and then we can restart 
> all other processes one by one (by using the upgraded hudi library) w/ 
> metadata enabled.  
> This is the only viable option I can think of. 
> 1. Stop all processes. Upgrade to hudi to a version w/ synchronous metadata. 
> bring up one writer process w/ metadata config enabled. this will bootstrap 
> the metadata table. and from there on, any new commits by the writer will do 
> synchronous updates to metadata.  
> Note: users can choose to upgrade via hudi-cli if need be. but easier would 
> be to just start the writer. Expect some delay for first commit since 
> bootstrap will be happening. 
> 2. Once first commit in previous writer process completes successfully, we 
> can restart all other processes. Upgrade the async table service (to hudi 
> version w/ metadata enabled) and restart it. *Ensure metadata table is 
> enabled across all processes.*  Even if missed on one, could result in data 
> loss.
>  
> By this, once metadata table is bootstrapped, any new commits from all 
> processes will be synced to metadata. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-2702) Set up keygen class explicit for write config for flink table upgrade

2021-11-05 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-2702.
--
Resolution: Fixed

Fixed via master branch: 9a8963d05eb93849e84151a63eecf12afe1017ce

> Set up keygen class explicit for write config for flink table upgrade
> -
>
> Key: HUDI-2702
> URL: https://issues.apache.org/jira/browse/HUDI-2702
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch master updated (08c35a5 -> 9a8963d)

2021-11-05 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 08c35a5  [HUDI-2526] Make spark.sql.parquet.writeLegacyFormat 
configurable (#3917)
 add 9a8963d  [HUDI-2702] Set up keygen class explicit for write config for 
flink table upgrade (#3931)

No new revisions were added by this update.

Summary of changes:
 hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java | 1 +
 1 file changed, 1 insertion(+)


[GitHub] [hudi] danny0405 merged pull request #3931: [HUDI-2702] Set up keygen class explicit for write config for flink t…

2021-11-05 Thread GitBox


danny0405 merged pull request #3931:
URL: https://github.com/apache/hudi/pull/3931


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

2021-11-05 Thread GitBox


prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744072970



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java
##
@@ -36,4 +49,9 @@
   public String getFileExtension() {
 return extension;
   }
+
+  public static boolean isBaseFile(Path path) {

Review comment:
   I feel its closer to file format but am open to move it if you feel 
otherwise. Also, the isLogFilecheck is in FSUtils since it uses some regex 
instead of the file extension.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

2021-11-05 Thread GitBox


prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744072865



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##
@@ -419,52 +394,53 @@ private boolean 
bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
* @param dataMetaClient
* @return Map of partition names to a list of FileStatus for all the files 
in the partition
*/
-  private Map> 
getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List listAllPartitions(HoodieTableMetaClient 
datasetMetaClient) {
 List pathsToList = new LinkedList<>();
 pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-Map> partitionToFileStatus = new HashMap<>();
+List foundPartitionsList = new LinkedList<>();

Review comment:
   Renamed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

2021-11-05 Thread GitBox


prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744072738



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##
@@ -419,52 +394,53 @@ private boolean 
bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
* @param dataMetaClient
* @return Map of partition names to a list of FileStatus for all the files 
in the partition
*/
-  private Map> 
getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List listAllPartitions(HoodieTableMetaClient 
datasetMetaClient) {
 List pathsToList = new LinkedList<>();
 pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-Map> partitionToFileStatus = new HashMap<>();
+List foundPartitionsList = new LinkedList<>();
 final int fileListingParallelism = 
metadataWriteConfig.getFileListingParallelism();
 SerializableConfiguration conf = new 
SerializableConfiguration(dataMetaClient.getHadoopConf());
 final String dirFilterRegex = 
dataWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
+final String datasetBasePath = dataMetaClient.getBasePath();
 
 while (!pathsToList.isEmpty()) {
-  int listingParallelism = Math.min(fileListingParallelism, 
pathsToList.size());
+  // In each round we will list a section of directories
+  int numDirsToList = Math.min(fileListingParallelism, pathsToList.size());
   // List all directories in parallel
-  List> dirToFileListing = 
engineContext.map(pathsToList, path -> {
+  List foundDirsList = 
engineContext.map(pathsToList.subList(0,  numDirsToList), path -> {
 FileSystem fs = path.getFileSystem(conf.get());
-return Pair.of(path, fs.listStatus(path));
-  }, listingParallelism);
-  pathsToList.clear();
+String relativeDirPath = FSUtils.getRelativePartitionPath(new 
Path(datasetBasePath), path);
+return new DirectoryInfo(relativeDirPath, fs.listStatus(path));
+  }, numDirsToList);
+
+  pathsToList = new LinkedList<>(pathsToList.subList(numDirsToList, 
pathsToList.size()));
 
   // If the listing reveals a directory, add it to queue. If the listing 
reveals a hoodie partition, add it to
   // the results.
-  dirToFileListing.forEach(p -> {
-if (!dirFilterRegex.isEmpty() && 
p.getLeft().getName().matches(dirFilterRegex)) {
-  LOG.info("Ignoring directory " + p.getLeft() + " which matches the 
filter regex " + dirFilterRegex);
-  return;
+  for (DirectoryInfo dirInfo : foundDirsList) {
+if (!dirFilterRegex.isEmpty()) {
+  final String relativePath = dirInfo.getRelativePath();
+  if (!relativePath.isEmpty()) {
+Path partitionPath = new Path(datasetBasePath, relativePath);
+if (partitionPath.getName().matches(dirFilterRegex)) {
+  LOG.info("Ignoring directory " + partitionPath + " which matches 
the filter regex " + dirFilterRegex);
+  continue;
+}
+  }
 }
 
-List filesInDir = Arrays.stream(p.getRight()).parallel()
-.filter(fs -> 
!fs.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE))
-.collect(Collectors.toList());
-
-if (p.getRight().length > filesInDir.size()) {
-  String partitionName = FSUtils.getRelativePartitionPath(new 
Path(dataMetaClient.getBasePath()), p.getLeft());
-  // deal with Non-partition table, we should exclude .hoodie
-  partitionToFileStatus.put(partitionName, filesInDir.stream()
-  .filter(f -> 
!f.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)).collect(Collectors.toList()));
+if (dirInfo.isPartition()) {
+  // Add to result
+  foundPartitionsList.add(dirInfo);
 } else {
   // Add sub-dirs to the queue
-  pathsToList.addAll(Arrays.stream(p.getRight())
-  .filter(fs -> fs.isDirectory() && 
!fs.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME))
-  .map(fs -> fs.getPath())
-  .collect(Collectors.toList()));
+  pathsToList.addAll(dirInfo.getSubdirs());

Review comment:
   DirectoryInfo constructor parses the FileStatus[] array and constructs:
   1. A list of sub-directories
   2. Whether the directory is a partition (presence of partition meta file)
   
   So in the code above, dirInfo.getSubdirs() should only return the 
sub-directories.
   
   The DirectoryInfo constructor was not ignoring the .hoodie directory and I 
will code for that. The .hoodie and its sub-dirs will be listed (sub-optimal) 
but none of them will be found to be partition due to lack of partition meta 
files. I will update the code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to 

[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

2021-11-05 Thread GitBox


prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r744072765



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##
@@ -419,52 +394,53 @@ private boolean 
bootstrapFromFilesystem(HoodieEngineContext engineContext, Hoodi
* @param dataMetaClient
* @return Map of partition names to a list of FileStatus for all the files 
in the partition
*/
-  private Map> 
getPartitionsToFilesMapping(HoodieTableMetaClient dataMetaClient) {
+  private List listAllPartitions(HoodieTableMetaClient 
datasetMetaClient) {
 List pathsToList = new LinkedList<>();
 pathsToList.add(new Path(dataWriteConfig.getBasePath()));
 
-Map> partitionToFileStatus = new HashMap<>();
+List foundPartitionsList = new LinkedList<>();
 final int fileListingParallelism = 
metadataWriteConfig.getFileListingParallelism();
 SerializableConfiguration conf = new 
SerializableConfiguration(dataMetaClient.getHadoopConf());
 final String dirFilterRegex = 
dataWriteConfig.getMetadataConfig().getDirectoryFilterRegex();
+final String datasetBasePath = dataMetaClient.getBasePath();
 
 while (!pathsToList.isEmpty()) {
-  int listingParallelism = Math.min(fileListingParallelism, 
pathsToList.size());
+  // In each round we will list a section of directories
+  int numDirsToList = Math.min(fileListingParallelism, pathsToList.size());
   // List all directories in parallel
-  List> dirToFileListing = 
engineContext.map(pathsToList, path -> {
+  List foundDirsList = 
engineContext.map(pathsToList.subList(0,  numDirsToList), path -> {

Review comment:
   Renamed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3931: [HUDI-2702] Set up keygen class explicit for write config for flink t…

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3931:
URL: https://github.com/apache/hudi/pull/3931#issuecomment-962379179


   
   ## CI report:
   
   * 1d00470d8bc57d799236a88a27b4a5b524904007 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3176)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3931: [HUDI-2702] Set up keygen class explicit for write config for flink t…

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3931:
URL: https://github.com/apache/hudi/pull/3931#issuecomment-962387547


   
   ## CI report:
   
   * 1d00470d8bc57d799236a88a27b4a5b524904007 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3176)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-2703) [RFC-37] Metadata based bloom index

2021-11-05 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-2703:
-

 Summary: [RFC-37] Metadata based bloom index
 Key: HUDI-2703
 URL: https://issues.apache.org/jira/browse/HUDI-2703
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: sivabalan narayanan
 Fix For: 0.10.0


Hudi has indices to assit in tagging incoming records. Most commonly used one 
is Bloom index. This involves looking up (loading) bloom from data files which 
could be time consuming and could have throttling impact in cloud stores like 
S3. So, proposing this RFC to add bloom as a special partition in metadata 
table and implement an index based on that. 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2703) [RFC-37] Metadata based bloom index

2021-11-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2703:
--
Labels: hudi-umbrellas  (was: )

> [RFC-37] Metadata based bloom index
> ---
>
> Key: HUDI-2703
> URL: https://issues.apache.org/jira/browse/HUDI-2703
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 0.10.0
>
>
> Hudi has indices to assit in tagging incoming records. Most commonly used one 
> is Bloom index. This involves looking up (loading) bloom from data files 
> which could be time consuming and could have throttling impact in cloud 
> stores like S3. So, proposing this RFC to add bloom as a special partition in 
> metadata table and implement an index based on that. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2703) [RFC-37] Metadata based bloom index

2021-11-05 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-2703:
-

Assignee: sivabalan narayanan

> [RFC-37] Metadata based bloom index
> ---
>
> Key: HUDI-2703
> URL: https://issues.apache.org/jira/browse/HUDI-2703
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 0.10.0
>
>
> Hudi has indices to assit in tagging incoming records. Most commonly used one 
> is Bloom index. This involves looking up (loading) bloom from data files 
> which could be time consuming and could have throttling impact in cloud 
> stores like S3. So, proposing this RFC to add bloom as a special partition in 
> metadata table and implement an index based on that. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2471) Add support ignoring case in merge into

2021-11-05 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HUDI-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

董可伦 updated HUDI-2471:
--
Summary: Add support ignoring case in merge into  (was: Add support 
ignoring case when  column name matches in merge into)

> Add support ignoring case in merge into
> ---
>
> Key: HUDI-2471
> URL: https://issues.apache.org/jira/browse/HUDI-2471
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot commented on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962379908


   
   ## CI report:
   
   * bdb2fa68fc669152c8978774f72912c5ae52d54b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3175)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962369431


   
   ## CI report:
   
   * f66b2f92c0ae8fa8af84ee4f91090aacc66c7f39 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3173)
 
   * bdb2fa68fc669152c8978774f72912c5ae52d54b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3175)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1492) Enhance DeltaWriteStat with block level metadata correctly for storage schemes that support appends

2021-11-05 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439581#comment-17439581
 ] 

Vinoth Chandar commented on HUDI-1492:
--

I am assigning this to [~manojg] to consider this during col stats. 

> Enhance DeltaWriteStat with block level metadata correctly for storage 
> schemes that support appends
> ---
>
> Key: HUDI-1492
> URL: https://issues.apache.org/jira/browse/HUDI-1492
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinoth Chandar
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Current implementation simply uses the
> {code:java}
> String pathWithPartition = hoodieWriteStat.getPath(); {code}
> to write the metadata table. this is problematic, if the delta write was 
> merely an append. and can technically add duplicate files into the metadata 
> table 
> (not sure if this is a problem per se. but filing a Jira to track and either 
> close/fix ) 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1492) Enhance DeltaWriteStat with block level metadata correctly for storage schemes that support appends

2021-11-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-1492:


Assignee: Manoj Govindassamy  (was: Vinoth Chandar)

> Enhance DeltaWriteStat with block level metadata correctly for storage 
> schemes that support appends
> ---
>
> Key: HUDI-1492
> URL: https://issues.apache.org/jira/browse/HUDI-1492
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinoth Chandar
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Current implementation simply uses the
> {code:java}
> String pathWithPartition = hoodieWriteStat.getPath(); {code}
> to write the metadata table. this is problematic, if the delta write was 
> merely an append. and can technically add duplicate files into the metadata 
> table 
> (not sure if this is a problem per se. but filing a Jira to track and either 
> close/fix ) 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1492) Enhance DeltaWriteStat with block level metadata correctly for storage schemes that support appends

2021-11-05 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1492:
-
Summary: Enhance DeltaWriteStat with block level metadata correctly for 
storage schemes that support appends  (was: Handle DeltaWriteStat correctly for 
storage schemes that support appends)

> Enhance DeltaWriteStat with block level metadata correctly for storage 
> schemes that support appends
> ---
>
> Key: HUDI-1492
> URL: https://issues.apache.org/jira/browse/HUDI-1492
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Current implementation simply uses the
> {code:java}
> String pathWithPartition = hoodieWriteStat.getPath(); {code}
> to write the metadata table. this is problematic, if the delta write was 
> merely an append. and can technically add duplicate files into the metadata 
> table 
> (not sure if this is a problem per se. but filing a Jira to track and either 
> close/fix ) 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1492) Handle DeltaWriteStat correctly for storage schemes that support appends

2021-11-05 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439580#comment-17439580
 ] 

Vinoth Chandar commented on HUDI-1492:
--

I am curious about what we are doing going to do here for say column stats. We 
probably would need to have the new stats merged with old stats for the same 
log. Bloom filters are also additive. So we good there. But not every index 
would be like that. So better to fix the delta commit metadata correctly.

 

that said, we already put in this code here, that will merge these file names. 

 
{code:java}
if (fileInfo.getIsDeleted()) {
  // file deletion
  combinedFileInfo.remove(filename);
}
 {code}

> Handle DeltaWriteStat correctly for storage schemes that support appends
> 
>
> Key: HUDI-1492
> URL: https://issues.apache.org/jira/browse/HUDI-1492
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Current implementation simply uses the
> {code:java}
> String pathWithPartition = hoodieWriteStat.getPath(); {code}
> to write the metadata table. this is problematic, if the delta write was 
> merely an append. and can technically add duplicate files into the metadata 
> table 
> (not sure if this is a problem per se. but filing a Jira to track and either 
> close/fix ) 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot removed a comment on pull request #3931: [HUDI-2702] Set up keygen class explicit for write config for flink t…

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3931:
URL: https://github.com/apache/hudi/pull/3931#issuecomment-962378941


   
   ## CI report:
   
   * 1d00470d8bc57d799236a88a27b4a5b524904007 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3931: [HUDI-2702] Set up keygen class explicit for write config for flink t…

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3931:
URL: https://github.com/apache/hudi/pull/3931#issuecomment-962379179


   
   ## CI report:
   
   * 1d00470d8bc57d799236a88a27b4a5b524904007 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3176)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3931: [HUDI-2702] Set up keygen class explicit for write config for flink t…

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3931:
URL: https://github.com/apache/hudi/pull/3931#issuecomment-962378941


   
   ## CI report:
   
   * 1d00470d8bc57d799236a88a27b4a5b524904007 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2702) Set up keygen class explicit for write config for flink table upgrade

2021-11-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-2702:
-
Labels: pull-request-available  (was: )

> Set up keygen class explicit for write config for flink table upgrade
> -
>
> Key: HUDI-2702
> URL: https://issues.apache.org/jira/browse/HUDI-2702
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] danny0405 opened a new pull request #3931: [HUDI-2702] Set up keygen class explicit for write config for flink t…

2021-11-05 Thread GitBox


danny0405 opened a new pull request #3931:
URL: https://github.com/apache/hudi/pull/3931


   …able upgrade
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-2702) Set up keygen class explicit for write config for flink table upgrade

2021-11-05 Thread Danny Chen (Jira)
Danny Chen created HUDI-2702:


 Summary: Set up keygen class explicit for write config for flink 
table upgrade
 Key: HUDI-2702
 URL: https://issues.apache.org/jira/browse/HUDI-2702
 Project: Apache Hudi
  Issue Type: Bug
  Components: Flink Integration
Reporter: Danny Chen
Assignee: Danny Chen
 Fix For: 0.10.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot commented on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962369431


   
   ## CI report:
   
   * f66b2f92c0ae8fa8af84ee4f91090aacc66c7f39 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3173)
 
   * bdb2fa68fc669152c8978774f72912c5ae52d54b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3175)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962303344


   
   ## CI report:
   
   * f66b2f92c0ae8fa8af84ee4f91090aacc66c7f39 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3173)
 
   * bdb2fa68fc669152c8978774f72912c5ae52d54b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3457: HUDI-1827 : Add ORC support in Bootstrap Op

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3457:
URL: https://github.com/apache/hudi/pull/3457#issuecomment-962364201


   
   ## CI report:
   
   * 9d3a77d8ffc223b9a01aa665a75f01d8c4ac8a6f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3174)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3457: HUDI-1827 : Add ORC support in Bootstrap Op

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3457:
URL: https://github.com/apache/hudi/pull/3457#issuecomment-962298432


   
   ## CI report:
   
   * d8029ad9eb62c5e5b0480ef3135ce068ab51dc6d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3171)
 
   * 9d3a77d8ffc223b9a01aa665a75f01d8c4ac8a6f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3174)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962303344


   
   ## CI report:
   
   * f66b2f92c0ae8fa8af84ee4f91090aacc66c7f39 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3173)
 
   * bdb2fa68fc669152c8978774f72912c5ae52d54b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962299391


   
   ## CI report:
   
   * 8f19898017c238eb1cc49c197670d5c233d81cfc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3172)
 
   * f66b2f92c0ae8fa8af84ee4f91090aacc66c7f39 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3173)
 
   * bdb2fa68fc669152c8978774f72912c5ae52d54b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962297655


   
   ## CI report:
   
   * 8f19898017c238eb1cc49c197670d5c233d81cfc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3172)
 
   * f66b2f92c0ae8fa8af84ee4f91090aacc66c7f39 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3173)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962299391


   
   ## CI report:
   
   * 8f19898017c238eb1cc49c197670d5c233d81cfc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3172)
 
   * f66b2f92c0ae8fa8af84ee4f91090aacc66c7f39 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3173)
 
   * bdb2fa68fc669152c8978774f72912c5ae52d54b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3457: HUDI-1827 : Add ORC support in Bootstrap Op

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3457:
URL: https://github.com/apache/hudi/pull/3457#issuecomment-962298432


   
   ## CI report:
   
   * d8029ad9eb62c5e5b0480ef3135ce068ab51dc6d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3171)
 
   * 9d3a77d8ffc223b9a01aa665a75f01d8c4ac8a6f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3174)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3457: HUDI-1827 : Add ORC support in Bootstrap Op

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3457:
URL: https://github.com/apache/hudi/pull/3457#issuecomment-962297946


   
   ## CI report:
   
   * d8029ad9eb62c5e5b0480ef3135ce068ab51dc6d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3171)
 
   * 9d3a77d8ffc223b9a01aa665a75f01d8c4ac8a6f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962297352


   
   ## CI report:
   
   * 8f19898017c238eb1cc49c197670d5c233d81cfc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3172)
 
   * f66b2f92c0ae8fa8af84ee4f91090aacc66c7f39 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962289158


   
   ## CI report:
   
   * 8f19898017c238eb1cc49c197670d5c233d81cfc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3172)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #3836: [HUDI-2591] Bootstrap metadata table only if upgrade / downgrade is not required.

2021-11-05 Thread GitBox


nsivabalan commented on a change in pull request #3836:
URL: https://github.com/apache/hudi/pull/3836#discussion_r744036617



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
##
@@ -95,10 +95,19 @@ public SparkRDDWriteClient(HoodieEngineContext context, 
HoodieWriteConfig writeC
   public SparkRDDWriteClient(HoodieEngineContext context, HoodieWriteConfig 
writeConfig,
  Option timelineService) {
 super(context, writeConfig, timelineService);
+bootstrapMetadataTable();
+  }
+
+  private void bootstrapMetadataTable() {
 if (config.isMetadataTableEnabled()) {
-  // If the metadata table does not exist, it should be bootstrapped here
-  // TODO: Check if we can remove this requirement - auto bootstrap on 
commit
-  
SparkHoodieBackedTableMetadataWriter.create(context.getHadoopConf().get(), 
config, context);
+  // Defer bootstrap if upgrade / downgrade is pending
+  HoodieTableMetaClient metaClient = createMetaClient(true);
+  UpgradeDowngrade upgradeDowngrade = new UpgradeDowngrade(
+  metaClient, config, context, 
SparkUpgradeDowngradeHelper.getInstance());
+  if 
(!upgradeDowngrade.needsUpgradeOrDowngrade(HoodieTableVersion.current())) {

Review comment:
thanks for the quick turn around. yes. Let's go with 1. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962289158


   
   ## CI report:
   
   * 8f19898017c238eb1cc49c197670d5c233d81cfc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3172)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962276054


   
   ## CI report:
   
   * 41ea70200b4e5fd5646a692d8a6110a67045b1d8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3080)
 
   * 8f19898017c238eb1cc49c197670d5c233d81cfc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3172)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #3927: [HUDI-2697]Minor changes about hbase index config.

2021-11-05 Thread GitBox


leesf commented on a change in pull request #3927:
URL: https://github.com/apache/hudi/pull/3927#discussion_r744027594



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -1300,7 +1300,7 @@ public int getHbaseIndexPutBatchSize() {
   }
 
   public Boolean getHbaseIndexPutBatchSizeAutoCompute() {

Review comment:
   would also change to boolean.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962276054


   
   ## CI report:
   
   * 41ea70200b4e5fd5646a692d8a6110a67045b1d8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3080)
 
   * 8f19898017c238eb1cc49c197670d5c233d81cfc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3172)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962275302


   
   ## CI report:
   
   * 41ea70200b4e5fd5646a692d8a6110a67045b1d8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3080)
 
   * 8f19898017c238eb1cc49c197670d5c233d81cfc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-962275302


   
   ## CI report:
   
   * 41ea70200b4e5fd5646a692d8a6110a67045b1d8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3080)
 
   * 8f19898017c238eb1cc49c197670d5c233d81cfc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3904: [WIP][HUDI-1295] Metadata Index - Bloom filter metadata to speed up index lookups

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3904:
URL: https://github.com/apache/hudi/pull/3904#issuecomment-961588811


   
   ## CI report:
   
   * 41ea70200b4e5fd5646a692d8a6110a67045b1d8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3080)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #3836: [HUDI-2591] Bootstrap metadata table only if upgrade / downgrade is not required.

2021-11-05 Thread GitBox


prashantwason commented on a change in pull request #3836:
URL: https://github.com/apache/hudi/pull/3836#discussion_r744019234



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
##
@@ -95,10 +95,19 @@ public SparkRDDWriteClient(HoodieEngineContext context, 
HoodieWriteConfig writeC
   public SparkRDDWriteClient(HoodieEngineContext context, HoodieWriteConfig 
writeConfig,
  Option timelineService) {
 super(context, writeConfig, timelineService);
+bootstrapMetadataTable();
+  }
+
+  private void bootstrapMetadataTable() {
 if (config.isMetadataTableEnabled()) {
-  // If the metadata table does not exist, it should be bootstrapped here
-  // TODO: Check if we can remove this requirement - auto bootstrap on 
commit
-  
SparkHoodieBackedTableMetadataWriter.create(context.getHadoopConf().get(), 
config, context);
+  // Defer bootstrap if upgrade / downgrade is pending
+  HoodieTableMetaClient metaClient = createMetaClient(true);
+  UpgradeDowngrade upgradeDowngrade = new UpgradeDowngrade(
+  metaClient, config, context, 
SparkUpgradeDowngradeHelper.getInstance());
+  if 
(!upgradeDowngrade.needsUpgradeOrDowngrade(HoodieTableVersion.current())) {

Review comment:
   I tried that step but it did not work because:
   1. When getTableAndInitCtx is called an action is already started on the 
table
   2. Metadata bootstrap does not happen because it detects an in-progress 
action 
   
   Bootstrap in the constructor is surely not ideal. 
   
   Possible ways:
   1. Make the bootstrap aware of the "current" operation so it can neglect it. 
Then we can bootstrap right after upgrade/downgrade step (as you suggested).
   2. Bootstrap automatically before write-to-metadata step. 
   
   I prefer #1 too as it is cleaner and metadata table will be available before 
any actions on the dataset start. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #3836: [HUDI-2591] Bootstrap metadata table only if upgrade / downgrade is not required.

2021-11-05 Thread GitBox


nsivabalan commented on a change in pull request #3836:
URL: https://github.com/apache/hudi/pull/3836#discussion_r744011894



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
##
@@ -95,10 +95,19 @@ public SparkRDDWriteClient(HoodieEngineContext context, 
HoodieWriteConfig writeC
   public SparkRDDWriteClient(HoodieEngineContext context, HoodieWriteConfig 
writeConfig,
  Option timelineService) {
 super(context, writeConfig, timelineService);
+bootstrapMetadataTable();
+  }
+
+  private void bootstrapMetadataTable() {
 if (config.isMetadataTableEnabled()) {
-  // If the metadata table does not exist, it should be bootstrapped here
-  // TODO: Check if we can remove this requirement - auto bootstrap on 
commit
-  
SparkHoodieBackedTableMetadataWriter.create(context.getHadoopConf().get(), 
config, context);
+  // Defer bootstrap if upgrade / downgrade is pending
+  HoodieTableMetaClient metaClient = createMetaClient(true);
+  UpgradeDowngrade upgradeDowngrade = new UpgradeDowngrade(
+  metaClient, config, context, 
SparkUpgradeDowngradeHelper.getInstance());
+  if 
(!upgradeDowngrade.needsUpgradeOrDowngrade(HoodieTableVersion.current())) {

Review comment:
   gotcha. wondering if we can move the instantiation of metadata table 
   ```
   SparkHoodieBackedTableMetadataWriter.create(context.getHadoopConf().get(), 
config, context);
   ```
   just after upgrade downgrade step within getTableAndInitCtx. (instead of in 
the constructor or SparkRDDWriteClient)
   So, that we may not miss to bootstrap the table even if upgrade is required. 
   
   Here is what I am thinking. I would assume typically users might stop all 
processes and then do an upgrade for next hudi version. Once first commit goes 
through, users might want to get started w/ multi writers. so, would prefer to 
do the bootstrap on the first write operation where upgrade is triggered. 
   Open to hear your thoughts. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #3836: [HUDI-2591] Bootstrap metadata table only if upgrade / downgrade is not required.

2021-11-05 Thread GitBox


nsivabalan commented on a change in pull request #3836:
URL: https://github.com/apache/hudi/pull/3836#discussion_r744011894



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
##
@@ -95,10 +95,19 @@ public SparkRDDWriteClient(HoodieEngineContext context, 
HoodieWriteConfig writeC
   public SparkRDDWriteClient(HoodieEngineContext context, HoodieWriteConfig 
writeConfig,
  Option timelineService) {
 super(context, writeConfig, timelineService);
+bootstrapMetadataTable();
+  }
+
+  private void bootstrapMetadataTable() {
 if (config.isMetadataTableEnabled()) {
-  // If the metadata table does not exist, it should be bootstrapped here
-  // TODO: Check if we can remove this requirement - auto bootstrap on 
commit
-  
SparkHoodieBackedTableMetadataWriter.create(context.getHadoopConf().get(), 
config, context);
+  // Defer bootstrap if upgrade / downgrade is pending
+  HoodieTableMetaClient metaClient = createMetaClient(true);
+  UpgradeDowngrade upgradeDowngrade = new UpgradeDowngrade(
+  metaClient, config, context, 
SparkUpgradeDowngradeHelper.getInstance());
+  if 
(!upgradeDowngrade.needsUpgradeOrDowngrade(HoodieTableVersion.current())) {

Review comment:
   gotcha. wondering if we can move the instantiation of metadata table 
   ```
   SparkHoodieBackedTableMetadataWriter.create(context.getHadoopConf().get(), 
config, context);
   ```
   just after upgrade downgrade step within getTableAndInitCtx. 
   So, that we may not miss to bootstrap the table even if upgrade is required. 
   
   Here is what I am thinking. I would assume typically users might stop all 
processes and then do an upgrade for next hudi version. Once first commit goes 
through, users might want to get started w/ multi writers. so, would prefer to 
do the bootstrap on the first write operation where upgrade is triggered. 
   Open to hear your thoughts. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3836: [HUDI-2591] Bootstrap metadata table only if upgrade / downgrade is not required.

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3836:
URL: https://github.com/apache/hudi/pull/3836#issuecomment-962231751


   
   ## CI report:
   
   * 2e3cb57760c2ba7769d9f9892fa73486c9c7ee74 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2748)
 
   * e916c158969b8b855cbde8bceac94e80b98d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3170)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3836: [HUDI-2591] Bootstrap metadata table only if upgrade / downgrade is not required.

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3836:
URL: https://github.com/apache/hudi/pull/3836#issuecomment-962254215


   
   ## CI report:
   
   * e916c158969b8b855cbde8bceac94e80b98d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3170)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3457: HUDI-1827 : Add ORC support in Bootstrap Op

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3457:
URL: https://github.com/apache/hudi/pull/3457#issuecomment-962244237


   
   ## CI report:
   
   * d8029ad9eb62c5e5b0480ef3135ce068ab51dc6d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3171)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3457: HUDI-1827 : Add ORC support in Bootstrap Op

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3457:
URL: https://github.com/apache/hudi/pull/3457#issuecomment-962243269


   
   ## CI report:
   
   * bf770ecbfccbacddb86a74832213b959addbfda2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1662)
 
   * d8029ad9eb62c5e5b0480ef3135ce068ab51dc6d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3457: HUDI-1827 : Add ORC support in Bootstrap Op

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3457:
URL: https://github.com/apache/hudi/pull/3457#issuecomment-962243269


   
   ## CI report:
   
   * bf770ecbfccbacddb86a74832213b959addbfda2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1662)
 
   * d8029ad9eb62c5e5b0480ef3135ce068ab51dc6d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3457: HUDI-1827 : Add ORC support in Bootstrap Op

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3457:
URL: https://github.com/apache/hudi/pull/3457#issuecomment-961588152


   
   ## CI report:
   
   * bf770ecbfccbacddb86a74832213b959addbfda2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1662)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-11-05 Thread GitBox


vinothchandar commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-962238516


   @minihippo sorry not mention here. @vingov is also going to try and test 
this as well for uber
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #3900: [HUDI-2595] Fixing metadata table updates such that only regular writes from data table can trigger table services in metadat

2021-11-05 Thread GitBox


vinothchandar commented on a change in pull request #3900:
URL: https://github.com/apache/hudi/pull/3900#discussion_r743985200



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/BaseActionExecutor.java
##
@@ -57,7 +57,7 @@ public BaseActionExecutor(HoodieEngineContext context, 
HoodieWriteConfig config,
* @param metadata commit metadata of interest.
*/
   protected final void writeTableMetadata(HoodieCommitMetadata metadata) {
-table.getMetadataWriter().ifPresent(w -> w.update(metadata, instantTime));
+table.getMetadataWriter().ifPresent(w -> w.update(metadata, instantTime, 
false));

Review comment:
   hmmm. should we assume it's a table service at this level? The idea is 
that even if BaseActionExecutor is further extended, they don't end up 
scheduling this on metadata table.

##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
##
@@ -305,7 +305,8 @@ protected void completeCompaction(HoodieCommitMetadata 
metadata, JavaRDD writeStats = 
writeStatuses.map(WriteStatus::getStat).collect();
-writeTableMetadata(table, metadata, new 
HoodieInstant(HoodieInstant.State.INFLIGHT, HoodieTimeline.COMPACTION_ACTION, 
compactionCommitTime));
+writeTableMetadata(table, metadata, new 
HoodieInstant(HoodieInstant.State.INFLIGHT, HoodieTimeline.COMPACTION_ACTION, 
compactionCommitTime),

Review comment:
   Instead of passing a flag everywhere, can't we just limit this based on 
the action types and table types alone. i.e 
   
   canTriggerTableServices = true iff (table_type = `cow` && action_type in 
("commit")) or (table_type = `mor` && action_type in ("deltacommit")). 
   
   won't that be simpler? 
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3836: [HUDI-2591] Bootstrap metadata table only if upgrade / downgrade is not required.

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3836:
URL: https://github.com/apache/hudi/pull/3836#issuecomment-962231751


   
   ## CI report:
   
   * 2e3cb57760c2ba7769d9f9892fa73486c9c7ee74 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2748)
 
   * e916c158969b8b855cbde8bceac94e80b98d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3170)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3836: [HUDI-2591] Bootstrap metadata table only if upgrade / downgrade is not required.

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3836:
URL: https://github.com/apache/hudi/pull/3836#issuecomment-962230403


   
   ## CI report:
   
   * 2e3cb57760c2ba7769d9f9892fa73486c9c7ee74 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2748)
 
   * e916c158969b8b855cbde8bceac94e80b98d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3836: [HUDI-2591] Bootstrap metadata table only if upgrade / downgrade is not required.

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3836:
URL: https://github.com/apache/hudi/pull/3836#issuecomment-961588627


   
   ## CI report:
   
   * 2e3cb57760c2ba7769d9f9892fa73486c9c7ee74 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2748)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3836: [HUDI-2591] Bootstrap metadata table only if upgrade / downgrade is not required.

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3836:
URL: https://github.com/apache/hudi/pull/3836#issuecomment-962230403


   
   ## CI report:
   
   * 2e3cb57760c2ba7769d9f9892fa73486c9c7ee74 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2748)
 
   * e916c158969b8b855cbde8bceac94e80b98d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #3836: [HUDI-2591] Bootstrap metadata table only if upgrade / downgrade is not required.

2021-11-05 Thread GitBox


prashantwason commented on a change in pull request #3836:
URL: https://github.com/apache/hudi/pull/3836#discussion_r743981870



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
##
@@ -95,10 +95,19 @@ public SparkRDDWriteClient(HoodieEngineContext context, 
HoodieWriteConfig writeC
   public SparkRDDWriteClient(HoodieEngineContext context, HoodieWriteConfig 
writeConfig,
  Option timelineService) {
 super(context, writeConfig, timelineService);
+bootstrapMetadataTable();
+  }
+
+  private void bootstrapMetadataTable() {
 if (config.isMetadataTableEnabled()) {
-  // If the metadata table does not exist, it should be bootstrapped here
-  // TODO: Check if we can remove this requirement - auto bootstrap on 
commit
-  
SparkHoodieBackedTableMetadataWriter.create(context.getHadoopConf().get(), 
config, context);
+  // Defer bootstrap if upgrade / downgrade is pending
+  HoodieTableMetaClient metaClient = createMetaClient(true);
+  UpgradeDowngrade upgradeDowngrade = new UpgradeDowngrade(
+  metaClient, config, context, 
SparkUpgradeDowngradeHelper.getInstance());
+  if 
(!upgradeDowngrade.needsUpgradeOrDowngrade(HoodieTableVersion.current())) {

Review comment:
   Yes, the bootstrap will happen the next time the SparkRDDWRiteCLient is 
created (probably in the next clean). 
   
   Currently this is what happens (assuming an existing Table with version 2 
and using 0.10 code which has version 3):
   1. SparkRDDWriteClient constructor - finds no table so bootstrap it (wasted 
bootstrap)
   2. SparkRDDWriteClient.insert() - runs upgrade code in getTableAndXXX() and 
there the metadata table is deleted.
   
   Next run:
   1. SparkRDDWriteClient constructor - finds no table so bootstrap it (second 
bootstrap)
   
   
   For file listing this wasted bootstrap is kinda ok but other indexes if 
enabled together (e.g. record-level-index enabled with metadata table), then 
this is a lot of wasted time.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #3880: [HUDI-2606] Enable metadata reader by default

2021-11-05 Thread GitBox


vinothchandar commented on a change in pull request #3880:
URL: https://github.com/apache/hudi/pull/3880#discussion_r743977052



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java
##
@@ -45,7 +45,7 @@
   .sinceVersion("0.7.0")
   .withDocumentation("Enable the internal metadata table which serves 
table metadata like level file listings");
 
-  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false;
+  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = true;

Review comment:
   I am wondering if we should just turn this on for Spark for now. 
Presto/Trino are being reworked a bit. Might be good to lay off




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #3880: [HUDI-2606] Enable metadata reader by default

2021-11-05 Thread GitBox


vinothchandar commented on a change in pull request #3880:
URL: https://github.com/apache/hudi/pull/3880#discussion_r743976266



##
File path: packaging/hudi-hive-sync-bundle/pom.xml
##
@@ -69,12 +69,18 @@
   org.apache.hudi:hudi-sync-common
   org.apache.hudi:hudi-hive-sync
 
+  org.apache.hbase:hbase-client
+  org.apache.hbase:hbase-common

Review comment:
   this is an include right? in the bundle?  why `test` scope? sorry not 
following?

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java
##
@@ -45,7 +45,7 @@
   .sinceVersion("0.7.0")
   .withDocumentation("Enable the internal metadata table which serves 
table metadata like level file listings");
 
-  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false;
+  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = true;

Review comment:
   does this generally handle the scenario where the metadata table is not 
present? or being blown away. 

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java
##
@@ -45,7 +45,7 @@
   .sinceVersion("0.7.0")
   .withDocumentation("Enable the internal metadata table which serves 
table metadata like level file listings");
 
-  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false;
+  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = true;

Review comment:
   Also what engines work off this? Spark, Hive, Presto and also Flink? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #3889: [HUDI-2443] Hudi KVComparator for all HFile writer usages

2021-11-05 Thread GitBox


vinothchandar commented on a change in pull request #3889:
URL: https://github.com/apache/hudi/pull/3889#discussion_r743971409



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/HoodieKVComparator.java
##
@@ -0,0 +1,29 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.util;

Review comment:
   `util` is not a great name IMO. lets have this in `common.io.storage` 

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
##
@@ -103,7 +104,7 @@ public HoodieLogBlockType getBlockType() {
 FSDataOutputStream ostream = new FSDataOutputStream(baos, null);
 
 HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig)
-.withOutputStream(ostream).withFileContext(context).create();
+.withOutputStream(ostream).withFileContext(context).withComparator(new 
HoodieKVComparator()).create();

Review comment:
   Lets make sure bootstrap index writing does not use this code path. 
again problematic 

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java
##
@@ -577,10 +578,4 @@ public String getName() {
 }
   }
 
-  /**
-   * This class is explicitly used as Key Comparator to workaround hard coded
-   * legacy format class names inside HBase. Otherwise we will face issues 
with shading.
-   */
-  public static class HoodieKVComparator extends KeyValue.KVComparator {

Review comment:
   we cannot remove this, since bootstrap index already uses this. lets 
retain this here. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #3824: [HUDI-2559] Millisecond granularity for instant timestamps

2021-11-05 Thread GitBox


nsivabalan commented on pull request #3824:
URL: https://github.com/apache/hudi/pull/3824#issuecomment-962198691


   @vinothchandar : writing a UT in TestHoodieActiveTimeline would suffice, 
just to compare old and new timestamp formats? or are you looking for tests at 
write client layer sort of. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

2021-11-05 Thread GitBox


nsivabalan commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-962201291






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #3930: [HUDI-431] [WIP] Adding parquet data block with inline read support

2021-11-05 Thread GitBox


vinothchandar commented on pull request #3930:
URL: https://github.com/apache/hudi/pull/3930#issuecomment-962196824


   @nsivabalan we should push this back a bit and may be even @codope can 
review this and take over? this is not directly related to metadata table 
right. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3457: HUDI-1827 : Add ORC support in Bootstrap Op

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3457:
URL: https://github.com/apache/hudi/pull/3457#issuecomment-961588152


   
   ## CI report:
   
   * bf770ecbfccbacddb86a74832213b959addbfda2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1662)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3468: [HUDI-2306] Support setting whether drop duplicate based on table typ…

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #3468:
URL: https://github.com/apache/hudi/pull/3468#issuecomment-898270779


   
   ## CI report:
   
   * abf61dd81ea40386a5f4f5f1d6bc5adc92df Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1708)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #2426: [HUDI-304] Configure spotless and java style

2021-11-05 Thread GitBox


hudi-bot commented on pull request #2426:
URL: https://github.com/apache/hudi/pull/2426#issuecomment-961587546


   
   ## CI report:
   
   * b7aa291685a1a350fff98607ebcdf0d19ce64f3f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3744: [HUDI-2108] Fix flakiness in TestHoodieBackedMetadata

2021-11-05 Thread GitBox


hudi-bot commented on pull request #3744:
URL: https://github.com/apache/hudi/pull/3744#issuecomment-961589758


   
   ## CI report:
   
   * 5a724c6c859d67980473db571c9a90b8babcf710 UNKNOWN
   * 7e3a2869f3728226e24d8e999a95a1a9d1335a87 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2507)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #2982: [HUDI-1441] Fixing HoodieAvroUtils.rewriteRecord for nested record schema evolution

2021-11-05 Thread GitBox


hudi-bot commented on pull request #2982:
URL: https://github.com/apache/hudi/pull/2982#issuecomment-961587749


   
   ## CI report:
   
   * 92ca2e97f51fea8ef906eabcc83eb77facb27c5d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2086)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #2426: [HUDI-304] Configure spotless and java style

2021-11-05 Thread GitBox


hudi-bot removed a comment on pull request #2426:
URL: https://github.com/apache/hudi/pull/2426#issuecomment-910862643


   
   ## CI report:
   
   * b7aa291685a1a350fff98607ebcdf0d19ce64f3f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #3873: [HUDI-2634] Improved the metadata table bootstrap for very large tables.

2021-11-05 Thread GitBox


prashantwason commented on a change in pull request #3873:
URL: https://github.com/apache/hudi/pull/3873#discussion_r743232434



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##
@@ -645,4 +612,83 @@ protected void doClean(AbstractHoodieWriteClient 
writeClient, String instantTime
 // metadata table.
 writeClient.clean(instantTime + "002");
   }
+
+  /**
+   * Commit the {@code HoodieRecord}s to Metadata Table as a new delta-commit.
+   *
+   */
+  protected abstract void commit(List records, String 
partitionName, String instantTime);
+
+  /**
+   * Commit the partition to file listing information to Metadata Table as a 
new delta-commit.
+   *
+   */
+  protected abstract void commit(List dirInfoList, String 
createInstantTime);
+
+
+  /**
+   * A class which represents a directory and the files and directories inside 
it.
+   *
+   * A {@code PartitionFileInfo} object saves the name of the partition and 
various properties requires of each file
+   * required for bootstrapping the metadata table. Saving limited properties 
reduces the total memory footprint when
+   * a very large number of files are present in the dataset being 
bootstrapped.
+   */
+  public static class DirectoryInfo implements Serializable {
+// Relative path of the directory (relative to the base directory)
+private String relativePath;
+// List of filenames within this partition
+private List filenames;
+// Length of the various files
+private List filelengths;
+// List of directories within this partition
+private List subdirs = new ArrayList<>();
+// Is this a HUDI partition
+private boolean isPartition = false;
+
+public DirectoryInfo(String relativePath, FileStatus[] fileStatus) {
+  this.relativePath = relativePath;
+
+  // Pre-allocate with the maximum length possible
+  filenames = new ArrayList<>(fileStatus.length);
+  filelengths = new ArrayList<>(fileStatus.length);
+
+  for (FileStatus status : fileStatus) {
+if (status.isDirectory()) {
+  this.subdirs.add(status.getPath());
+} else if 
(status.getPath().getName().equals(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE))
 {
+  // Presence of partition meta file implies this is a HUDI partition
+  this.isPartition = true;
+} else if (FSUtils.isDataFile(status.getPath())) {
+  // Regular HUDI data file (base file or log file)
+  filenames.add(status.getPath().getName());
+  filelengths.add(status.getLen());
+}
+  }
+}
+
+public String getRelativePath() {
+  return relativePath;
+}
+
+public int getTotalFiles() {
+  return filenames.size();
+}
+
+public boolean isPartition() {
+  return isPartition;
+}
+
+public List getSubdirs() {
+  return subdirs;
+}
+
+// Returns a map of filenames mapped to their lengths
+public Map getFileMap() {

Review comment:
   Done. A good simplification indeed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #3868: [SUPPORT] hive syncing with `--spark-datasource` (first title was: Querying hudi datasets from standalone metastore)

2021-11-05 Thread GitBox


nsivabalan commented on issue #3868:
URL: https://github.com/apache/hudi/issues/3868#issuecomment-961860610


   hey. Can you give us full set of configs you used to create/write to hudi 
table along with sync configs used. would help us triage the issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yuzhaojing commented on pull request #3903: [HUDI-2651] Sync all the missing sql options for HoodieFlinkStreamer

2021-11-05 Thread GitBox


yuzhaojing commented on pull request #3903:
URL: https://github.com/apache/hudi/pull/3903#issuecomment-961582529


   > +1, thanks for the contribution, there are also some configuration 
inference in `HoodieTableSink`, we can do that in following PR though.
   
   Ok, I will do that in a other PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   3   4   5   6   7   8   >