[GitHub] [hudi] xuzifu666 commented on a change in pull request #4245: [MINOR] remove unuse construction method

2021-12-20 Thread GitBox


xuzifu666 commented on a change in pull request #4245:
URL: https://github.com/apache/hudi/pull/4245#discussion_r772902282



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java
##
@@ -579,10 +579,6 @@ public RecordReader getRecordReader(InputSplit split, 
JobConf job, Reporter repo
 protected CombineFileSplit inputSplitShim;
 private Map pathToPartitionInfo;
 
-public CombineHiveInputSplit() throws IOException {

Review comment:
   no, this would not be called by serialization code for e.g




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4364: [HUDI-3060] drop table for spark sql

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4364:
URL: https://github.com/apache/hudi/pull/4364#issuecomment-998550027


   
   ## CI report:
   
   * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627)
 
   * edb0803691023e502011e270ee83b69bc87c1ffb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4630)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4364: [HUDI-3060] drop table for spark sql

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4364:
URL: https://github.com/apache/hudi/pull/4364#issuecomment-998548499


   
   ## CI report:
   
   * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627)
 
   * edb0803691023e502011e270ee83b69bc87c1ffb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4308:
URL: https://github.com/apache/hudi/pull/4308#issuecomment-998549910


   
   ## CI report:
   
   * 7d046f914a059b2623d7f2a7627c44b15ccc0ddb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4628)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4308:
URL: https://github.com/apache/hudi/pull/4308#issuecomment-998510982


   
   ## CI report:
   
   * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620)
 
   * 7d046f914a059b2623d7f2a7627c44b15ccc0ddb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4628)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4364: [HUDI-3060] drop table for spark sql

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4364:
URL: https://github.com/apache/hudi/pull/4364#issuecomment-998548499


   
   ## CI report:
   
   * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627)
 
   * edb0803691023e502011e270ee83b69bc87c1ffb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4364: [HUDI-3060] drop table for spark sql

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4364:
URL: https://github.com/apache/hudi/pull/4364#issuecomment-998529708


   
   ## CI report:
   
   * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on pull request #3929: [HUDI-1881] Make multi table delta streamer to use thread pool for table sync asynchronously.

2021-12-20 Thread GitBox


pratyakshsharma commented on pull request #3929:
URL: https://github.com/apache/hudi/pull/3929#issuecomment-998543878


   @jadireddi Were you able to test it out?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-3085) Refactor fileId & writeHandler logic into partitioner for bulk_insert

2021-12-20 Thread Yuwei Xiao (Jira)
Yuwei Xiao created HUDI-3085:


 Summary: Refactor fileId & writeHandler logic into partitioner for 
bulk_insert
 Key: HUDI-3085
 URL: https://issues.apache.org/jira/browse/HUDI-3085
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Yuwei Xiao


a better partitioner abstraction for bulk_insert



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HUDI-2998) Claim RFC number for RFC for Consistent Hashing Index

2021-12-20 Thread Yuwei Xiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuwei Xiao resolved HUDI-2998.
--

> Claim RFC number for RFC for Consistent Hashing Index
> -
>
> Key: HUDI-2998
> URL: https://issues.apache.org/jira/browse/HUDI-2998
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yuwei Xiao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] xushiyan edited a comment on pull request #4270: [HUDI-2811] Support Spark 3.2

2021-12-20 Thread GitBox


xushiyan edited a comment on pull request #4270:
URL: https://github.com/apache/hudi/pull/4270#issuecomment-998529462


   @leesf i mainly concern about cherry-picking some spark sql fixes won't work 
after this lands in master. That's why i suggested this be a feature branch 
that keeps rebasing on master. Any new feature depends on spark 3.2 support 
should not be blocked as those shall be merged into this feature branch. cc 
@YannByron @nsivabalan 
   
   Alternatively, we can finalize all spark sql related fixes for 0.10.1 in the 
next few days and land those in master. Then we can land this so we know those 
fixes can be cherry-picked later. Sounds better?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4364: [HUDI-3060] drop table for spark sql

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4364:
URL: https://github.com/apache/hudi/pull/4364#issuecomment-998529708


   
   ## CI report:
   
   * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4364: [HUDI-3060] drop table for spark sql

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4364:
URL: https://github.com/apache/hudi/pull/4364#issuecomment-998484771


   
   ## CI report:
   
   * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on pull request #4270: [HUDI-2811] Support Spark 3.2

2021-12-20 Thread GitBox


xushiyan commented on pull request #4270:
URL: https://github.com/apache/hudi/pull/4270#issuecomment-998529462


   @leesf i mainly concern about cherry-picking some spark sql fixes won't work 
after this lands in master. That's why i suggested this be a feature branch 
that keeps rebasing on master. Any new feature depends on spark 3.2 support 
should not be blocked as those shall be merged into this feature branch. cc 
@YannByron @nsivabalan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #4270: [HUDI-2811] Support Spark 3.2

2021-12-20 Thread GitBox


leesf commented on pull request #4270:
URL: https://github.com/apache/hudi/pull/4270#issuecomment-998522239


   > @leesf @YannByron shall we keep this open until 0.10.1 is cut? given this 
won't be included in 0.10.1 and any bug fix PR on spark sql may have major 
conflicts with this change. I suggest we keep this as a feature branch and keep 
updating it and merge after 0.10.1. WDYT?
   
   @xushiyan Agree that it should goes to 0.11.0, but as the 0.10.1 is not 
going release in recent days, should we wait for the cut? it would block the 
development of new features, so i think it should merge into master branch and 
then just cherry-pick some bug fixes from master into 0.10.1 CC @nsivabalan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4342: [HUDI-735] Fixing error messages on record key not found

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4342:
URL: https://github.com/apache/hudi/pull/4342#issuecomment-998521581


   
   ## CI report:
   
   * 4ece718901801f81a97ebd4667e72a81e39b18e9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4626)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4342: [HUDI-735] Fixing error messages on record key not found

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4342:
URL: https://github.com/apache/hudi/pull/4342#issuecomment-998484733


   
   ## CI report:
   
   * 0638721fb418fc2e8d2fff47657617ea1d203b6d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4597)
 
   * 4ece718901801f81a97ebd4667e72a81e39b18e9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4626)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve

2021-12-20 Thread GitBox


scxwhite commented on a change in pull request #4400:
URL: https://github.com/apache/hudi/pull/4400#discussion_r772869290



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java
##
@@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan(
 .getLatestFileSlices(partitionPath)
 .filter(slice -> 
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
 .map(s -> {
+  // We can think that the latest data is in the latest delta log 
file, so we sort it from large

Review comment:
   > Have a clarification on the first fix. Could you add some UTs for this?
   
   OK, I'll try to add some UTs




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] scxwhite commented on a change in pull request #4400: [HUDI-3069] compact improve

2021-12-20 Thread GitBox


scxwhite commented on a change in pull request #4400:
URL: https://github.com/apache/hudi/pull/4400#discussion_r772868883



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java
##
@@ -264,8 +264,11 @@ HoodieCompactionPlan generateCompactionPlan(
 .getLatestFileSlices(partitionPath)
 .filter(slice -> 
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
 .map(s -> {
+  // We can think that the latest data is in the latest delta log 
file, so we sort it from large

Review comment:
   You're right, but in most cases, the new data is often in the latest 
delta log, so we sort it from large to small according to the instance time. 
The program will avoid updating the data in the externalspillablemap to save 
compact time. What do you think




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4408: [MINOR] unused method in HoodieColumnProjectionUtils removed

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4408:
URL: https://github.com/apache/hudi/pull/4408#issuecomment-998514524


   
   ## CI report:
   
   * 60e07eccbc1750c6e7fe4275274d48e7095ff407 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4408: [MINOR] unused method in HoodieColumnProjectionUtils removed

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4408:
URL: https://github.com/apache/hudi/pull/4408#issuecomment-998515702


   
   ## CI report:
   
   * 60e07eccbc1750c6e7fe4275274d48e7095ff407 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4629)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4408: [MINOR] unused method in HoodieColumnProjectionUtils removed

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4408:
URL: https://github.com/apache/hudi/pull/4408#issuecomment-998514524


   
   ## CI report:
   
   * 60e07eccbc1750c6e7fe4275274d48e7095ff407 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xuzifu666 opened a new pull request #4408: [MINOR] unused method in HoodieColumnProjectionUtils removed

2021-12-20 Thread GitBox


xuzifu666 opened a new pull request #4408:
URL: https://github.com/apache/hudi/pull/4408


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] mtami commented on issue #3429: [SUPPORT] Upserting timestamp with microseconds precision truncate the microseconds part

2021-12-20 Thread GitBox


mtami commented on issue #3429:
URL: https://github.com/apache/hudi/issues/3429#issuecomment-998512708


   Hi @nsivabalan 
   
   It's a timestamp string,  i cast it to timestamp.
   
   `input_df = input_df.withColumn('updated', f.to_timestamp(f.col('updated')))`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] waywtdcc commented on issue #4305: [SUPPORT] Duplicate Flink write record

2021-12-20 Thread GitBox


waywtdcc commented on issue #4305:
URL: https://github.com/apache/hudi/issues/4305#issuecomment-998512387


   streaming.It is ok when open set 'index.global.enabled' for 'true'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4336: [HUDI-3032] Do not clean the log files right after compaction for met…

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4336:
URL: https://github.com/apache/hudi/pull/4336#issuecomment-998511039


   
   ## CI report:
   
   * 8f454b734d8848aee4cb6883999a658a7f007fc2 UNKNOWN
   * 6d39c38f416f2e0f8249f8bcff2434c07fe929aa UNKNOWN
   * 6e8aba03f04cc266b5e14f7f991770f1e024 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4367)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4625)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4308:
URL: https://github.com/apache/hudi/pull/4308#issuecomment-998510982


   
   ## CI report:
   
   * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620)
 
   * 7d046f914a059b2623d7f2a7627c44b15ccc0ddb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4628)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4336: [HUDI-3032] Do not clean the log files right after compaction for met…

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4336:
URL: https://github.com/apache/hudi/pull/4336#issuecomment-998483702


   
   ## CI report:
   
   * 8f454b734d8848aee4cb6883999a658a7f007fc2 UNKNOWN
   * 6d39c38f416f2e0f8249f8bcff2434c07fe929aa UNKNOWN
   * 6e8aba03f04cc266b5e14f7f991770f1e024 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4367)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4625)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] harsh1231 commented on a change in pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields

2021-12-20 Thread GitBox


harsh1231 commented on a change in pull request #4308:
URL: https://github.com/apache/hudi/pull/4308#discussion_r772862716



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
##
@@ -123,6 +121,25 @@ case class HoodieFileIndex(
 }
   }
 
+  /**
+   * This method traverses StructType recursively to build map of columnName 
-> StructField
+   * Note : If there is nesting of columns like ["a.b.c.d", "a.b.c.e"]  -> 
final map will have keys corresponding
+   * only to ["a.b.c.d", "a.b.c.e"] and not for subsets like ["a.b.c", "a.b"]
+   * @param structField
+   * @return map of ( columns names -> StructField )
+   */
+  private def generateNameFieldMap(structField : Either[StructField, 
StructType]) : Map[String, StructField] = {
+structField match {

Review comment:
   Done, thanks for pointing out code style of scala.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4308:
URL: https://github.com/apache/hudi/pull/4308#issuecomment-998509790


   
   ## CI report:
   
   * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620)
 
   * 7d046f914a059b2623d7f2a7627c44b15ccc0ddb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4308:
URL: https://github.com/apache/hudi/pull/4308#issuecomment-998509790


   
   ## CI report:
   
   * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620)
 
   * 7d046f914a059b2623d7f2a7627c44b15ccc0ddb UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4308:
URL: https://github.com/apache/hudi/pull/4308#issuecomment-998465363


   
   ## CI report:
   
   * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xuzifu666 commented on pull request #4245: [MINOR] remove unuse construction method

2021-12-20 Thread GitBox


xuzifu666 commented on pull request #4245:
URL: https://github.com/apache/hudi/pull/4245#issuecomment-998509370


   @yanghua please have a review, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] YannByron commented on pull request #4270: [HUDI-2811] Support Spark 3.2

2021-12-20 Thread GitBox


YannByron commented on pull request #4270:
URL: https://github.com/apache/hudi/pull/4270#issuecomment-998503222


   > WDYT
   
   I agree. this is a feature should in a major version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] RocMarshal commented on pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.

2021-12-20 Thread GitBox


RocMarshal commented on pull request #3813:
URL: https://github.com/apache/hudi/pull/3813#issuecomment-998502857


   @vinothchandar I made some change based on your comments. PTAL. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-3083) Support component data types for flink bulk_insert

2021-12-20 Thread dalongliu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dalongliu reassigned HUDI-3083:
---

Assignee: dalongliu

> Support component data types for flink bulk_insert
> --
>
> Key: HUDI-3083
> URL: https://issues.apache.org/jira/browse/HUDI-3083
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: dalongliu
>Priority: Major
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] prashantwason commented on a change in pull request #4336: [HUDI-3032] Do not clean the log files right after compaction for met…

2021-12-20 Thread GitBox


prashantwason commented on a change in pull request #4336:
URL: https://github.com/apache/hudi/pull/4336#discussion_r772843537



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##
@@ -706,7 +706,20 @@ protected void 
compactIfNecessary(AbstractHoodieWriteClient writeClient, String
 }
   }
 
-  protected void doClean(AbstractHoodieWriteClient writeClient, String 
instantTime) {
+  protected void cleanIfNecessary(AbstractHoodieWriteClient writeClient, 
String instantTime) {
+Option lastCompletedCompactionInstant = 
metadataMetaClient.reloadActiveTimeline()
+.getCommitTimeline().filterCompletedInstants().lastInstant();
+if (lastCompletedCompactionInstant.isPresent()
+&& metadataMetaClient.getActiveTimeline().filterCompletedInstants()
+
.findInstantsAfter(lastCompletedCompactionInstant.get().getTimestamp()).countInstants()
 < 3) {
+  // do not clean the log files immediately after compaction to give some 
buffer time for metadata table reader,

Review comment:
   So this problem should also exist in the MOR table data path? Is there 
any solution there?

##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java
##
@@ -154,7 +154,7 @@ protected void commit(HoodieData 
hoodieDataRecords, String partiti
   metadataMetaClient.reloadActiveTimeline();

Review comment:
   reloadActiveTimelice called here so not necessary in ccleanIfNeceasry/

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##
@@ -706,7 +706,20 @@ protected void 
compactIfNecessary(AbstractHoodieWriteClient writeClient, String
 }
   }
 
-  protected void doClean(AbstractHoodieWriteClient writeClient, String 
instantTime) {
+  protected void cleanIfNecessary(AbstractHoodieWriteClient writeClient, 
String instantTime) {
+Option lastCompletedCompactionInstant = 
metadataMetaClient.reloadActiveTimeline()

Review comment:
   is reloadActiveTimeline() neceassary here? 

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##
@@ -706,7 +706,20 @@ protected void 
compactIfNecessary(AbstractHoodieWriteClient writeClient, String
 }
   }
 
-  protected void doClean(AbstractHoodieWriteClient writeClient, String 
instantTime) {
+  protected void cleanIfNecessary(AbstractHoodieWriteClient writeClient, 
String instantTime) {
+Option lastCompletedCompactionInstant = 
metadataMetaClient.reloadActiveTimeline()

Review comment:
   Also, can you check if there is already a metadata table function to get 
the last compaction timestamp? 
   
   I guess there are other code paths where this is required. So would be a 
good idea to create a utility function if does not exist. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-3084) Fix the link of flink guide page

2021-12-20 Thread Danny Chen (Jira)
Danny Chen created HUDI-3084:


 Summary: Fix the link of flink guide page
 Key: HUDI-3084
 URL: https://issues.apache.org/jira/browse/HUDI-3084
 Project: Apache Hudi
  Issue Type: Bug
  Components: Docs
Reporter: Danny Chen
 Fix For: 0.11.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3083) Support component data types for flink bulk_insert

2021-12-20 Thread Danny Chen (Jira)
Danny Chen created HUDI-3083:


 Summary: Support component data types for flink bulk_insert
 Key: HUDI-3083
 URL: https://issues.apache.org/jira/browse/HUDI-3083
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Flink Integration
Reporter: Danny Chen
 Fix For: 0.11.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462993#comment-17462993
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

{*}Note{*}: I ran the recent query from 'master' as I needed a fix of running 
clustering in parallel from master.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, 
> stderr_part2.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 

[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462993#comment-17462993
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 5:29 AM:


{*}Note{*}: I ran the recent query using 'master' as I needed a fix of running 
clustering in parallel from master.


was (Author: h7kanna):
{*}Note{*}: I ran the recent query from 'master' as I needed a fix of running 
clustering in parallel from master.

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, 
> stderr_part2.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-000

[GitHub] [hudi] hudi-bot commented on pull request #4342: [HUDI-735] Fixing error messages on record key not found

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4342:
URL: https://github.com/apache/hudi/pull/4342#issuecomment-998484733


   
   ## CI report:
   
   * 0638721fb418fc2e8d2fff47657617ea1d203b6d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4597)
 
   * 4ece718901801f81a97ebd4667e72a81e39b18e9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4626)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4342: [HUDI-735] Fixing error messages on record key not found

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4342:
URL: https://github.com/apache/hudi/pull/4342#issuecomment-998483731


   
   ## CI report:
   
   * 0638721fb418fc2e8d2fff47657617ea1d203b6d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4597)
 
   * 4ece718901801f81a97ebd4667e72a81e39b18e9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4364: [HUDI-3060] drop table for spark sql

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4364:
URL: https://github.com/apache/hudi/pull/4364#issuecomment-998484771


   
   ## CI report:
   
   * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4627)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4364: [HUDI-3060] drop table for spark sql

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4364:
URL: https://github.com/apache/hudi/pull/4364#issuecomment-998466293


   
   ## CI report:
   
   * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] XuQianJin-Stars commented on pull request #4364: [HUDI-3060] drop table for spark sql

2021-12-20 Thread GitBox


XuQianJin-Stars commented on pull request #4364:
URL: https://github.com/apache/hudi/pull/4364#issuecomment-998484586


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4342: [HUDI-735] Fixing error messages on record key not found

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4342:
URL: https://github.com/apache/hudi/pull/4342#issuecomment-998483731


   
   ## CI report:
   
   * 0638721fb418fc2e8d2fff47657617ea1d203b6d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4597)
 
   * 4ece718901801f81a97ebd4667e72a81e39b18e9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4342: [HUDI-735] Fixing error messages on record key not found

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4342:
URL: https://github.com/apache/hudi/pull/4342#issuecomment-997972323


   
   ## CI report:
   
   * 0638721fb418fc2e8d2fff47657617ea1d203b6d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4597)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4336: [HUDI-3032] Do not clean the log files right after compaction for met…

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4336:
URL: https://github.com/apache/hudi/pull/4336#issuecomment-998483702


   
   ## CI report:
   
   * 8f454b734d8848aee4cb6883999a658a7f007fc2 UNKNOWN
   * 6d39c38f416f2e0f8249f8bcff2434c07fe929aa UNKNOWN
   * 6e8aba03f04cc266b5e14f7f991770f1e024 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4367)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4625)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4336: [HUDI-3032] Do not clean the log files right after compaction for met…

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4336:
URL: https://github.com/apache/hudi/pull/4336#issuecomment-995532732


   
   ## CI report:
   
   * 8f454b734d8848aee4cb6883999a658a7f007fc2 UNKNOWN
   * 6d39c38f416f2e0f8249f8bcff2434c07fe929aa UNKNOWN
   * 6e8aba03f04cc266b5e14f7f991770f1e024 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4367)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on pull request #4336: [HUDI-3032] Do not clean the log files right after compaction for met…

2021-12-20 Thread GitBox


danny0405 commented on pull request #4336:
URL: https://github.com/apache/hudi/pull/4336#issuecomment-998483112


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] harsh1231 commented on a change in pull request #4342: [HUDI-735] Fixing error messages on record key not found

2021-12-20 Thread GitBox


harsh1231 commented on a change in pull request #4342:
URL: https://github.com/apache/hudi/pull/4342#discussion_r772836433



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -229,7 +229,11 @@ object HoodieSparkSqlWriter {
 }
 sparkContext.getConf.registerAvroSchemas(schema)
 log.info(s"Registered avro schema : ${schema.toString(true)}")
-
+val columnSet = df.columns.toSet
+keyGenerator.getRecordKeyFieldNames.foreach(fieldName => 
if(!columnSet.contains(fieldName)) {
+  throw new Exception(s"record key '$fieldName' does not exist in 
existing table schema " +

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on pull request #2768: [HUDI-485]: corrected the check for incremental sql

2021-12-20 Thread GitBox


pratyakshsharma commented on pull request #2768:
URL: https://github.com/apache/hudi/pull/2768#issuecomment-998481547


   Ack. Let me close this in a day or two.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: (was: Screen Shot 2021-12-20 at 10.17.44 PM-1.png)

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, stderr_part1.txt, 
> stderr_part2.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: (was: stderr_part2-1.txt)

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for l

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: (was: timeline-1.txt)

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfi

[GitHub] [hudi] hudi-bot commented on pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.

2021-12-20 Thread GitBox


hudi-bot commented on pull request #3813:
URL: https://github.com/apache/hudi/pull/3813#issuecomment-998478837


   
   ## CI report:
   
   * f05b7dc67c6fb1d6b9bf75fac0c37b42925bfa23 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4621)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3813: [HUDI-2563][hudi-client] Refactor CompactionTriggerStrategy.

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #3813:
URL: https://github.com/apache/hudi/pull/3813#issuecomment-998444978


   
   ## CI report:
   
   * fb97a5759a60ffa76ae776ed1c53f9c33f8eb81b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3628)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3622)
 
   * f05b7dc67c6fb1d6b9bf75fac0c37b42925bfa23 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4621)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-2970) Archival fails with Delete_partition commits

2021-12-20 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-2970.


> Archival fails with Delete_partition commits
> 
>
> Key: HUDI-2970
> URL: https://issues.apache.org/jira/browse/HUDI-2970
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available, sev:critical
> Fix For: 0.11.0, 0.10.1
>
>
> We need to fix the archival in data table which has delete partition 
> operations. archival does not sit well with replace commit files created for 
> "delete partition" operation. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[hudi] branch master updated: [HUDI-2970] Add test for archiving replace commit (#4345)

2021-12-20 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 32a44bb  [HUDI-2970] Add test for archiving replace commit (#4345)
32a44bb is described below

commit 32a44bbe062c997b5a41266290fbe34d6323bfa6
Author: Raymond Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Mon Dec 20 21:01:59 2021 -0800

[HUDI-2970] Add test for archiving replace commit (#4345)
---
 ...dieSparkCopyOnWriteTableArchiveWithReplace.java | 103 +
 .../TestHoodieSparkMergeOnReadTableClustering.java |  12 +--
 ...HoodieSparkMergeOnReadTableIncrementalRead.java |   6 +-
 ...dieSparkMergeOnReadTableInsertUpdateDelete.java |   4 +-
 .../SparkClientFunctionalTestHarness.java  |   4 +-
 .../common/testutils/HoodieTestDataGenerator.java  |   3 +-
 6 files changed, 118 insertions(+), 14 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkCopyOnWriteTableArchiveWithReplace.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkCopyOnWriteTableArchiveWithReplace.java
new file mode 100644
index 000..1c66023
--- /dev/null
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkCopyOnWriteTableArchiveWithReplace.java
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.functional;
+
+import org.apache.hudi.client.SparkRDDWriteClient;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieCompactionConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.testutils.SparkClientFunctionalTestHarness;
+
+import org.junit.jupiter.api.Tag;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.ValueSource;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import static 
org.apache.hudi.common.testutils.HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH;
+import static 
org.apache.hudi.common.testutils.HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS;
+import static 
org.apache.hudi.common.testutils.HoodieTestDataGenerator.DEFAULT_SECOND_PARTITION_PATH;
+import static 
org.apache.hudi.common.testutils.HoodieTestDataGenerator.DEFAULT_THIRD_PARTITION_PATH;
+import static 
org.apache.hudi.testutils.HoodieClientTestUtils.countRecordsOptionallySince;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+@Tag("functional")
+public class TestHoodieSparkCopyOnWriteTableArchiveWithReplace extends 
SparkClientFunctionalTestHarness {
+
+  @ParameterizedTest
+  @ValueSource(booleans = {false, true})
+  public void testDeletePartitionAndArchive(boolean metadataEnabled) throws 
IOException {
+HoodieTableMetaClient metaClient = 
getHoodieMetaClient(HoodieTableType.COPY_ON_WRITE);
+HoodieWriteConfig writeConfig = getConfigBuilder(true)
+
.withCompactionConfig(HoodieCompactionConfig.newBuilder().archiveCommitsWith(2, 
3).retainCommits(1).build())
+
.withMetadataConfig(HoodieMetadataConfig.newBuilder().enable(metadataEnabled).build())
+.build();
+try (SparkRDDWriteClient client = getHoodieWriteClient(writeConfig);
+ HoodieTestDataGenerator dataGen = new 
HoodieTestDataGenerator(DEFAULT_PARTITION_PATHS)) {
+
+  // 1st write batch; 3 commits for 3 partitions
+  String instantTime1 = HoodieActiveTimeline.createNewInstantTime(1000);
+  client.startCommitWithTime(instantTime1);
+  
client.insert(jsc().parallelize(dataGen.generateInsertsForPartition(instantTime1,
 10, DEFAULT_FIRST_PARTITION_PA

[GitHub] [hudi] nsivabalan merged pull request #4345: [HUDI-2970] Add test for archiving replace commit

2021-12-20 Thread GitBox


nsivabalan merged pull request #4345:
URL: https://github.com/apache/hudi/pull/4345


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-3066:
-
Status: In Progress  (was: Open)

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, 
> stderr_part2.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Movin

[jira] [Commented] (HUDI-2834) Validate against supported hive versions

2021-12-20 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462988#comment-17462988
 ] 

Raymond Xu commented on HUDI-2834:
--

[~codope] [~shivnarayan] This was deprioritized from 0.10.0 release. How do you 
think we should handle hive versions in 0.11.0?

> Validate against supported hive versions
> 
>
> Key: HUDI-2834
> URL: https://issues.apache.org/jira/browse/HUDI-2834
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] xushiyan opened a new pull request #3744: [HUDI-2108] Fix flakiness in TestHoodieBackedMetadata

2021-12-20 Thread GitBox


xushiyan opened a new pull request #3744:
URL: https://github.com/apache/hudi/pull/3744


   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan closed pull request #3744: [HUDI-2108] Fix flakiness in TestHoodieBackedMetadata

2021-12-20 Thread GitBox


xushiyan closed pull request #3744:
URL: https://github.com/apache/hudi/pull/3744


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan closed pull request #4138: [HUDI-2781] Set spark3 in azure pipelines

2021-12-20 Thread GitBox


xushiyan closed pull request #4138:
URL: https://github.com/apache/hudi/pull/4138


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3066:
-
Priority: Blocker  (was: Major)

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, 
> stderr_part2.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the 

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3066:
-
Fix Version/s: 0.11.0

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, 
> stderr_part2.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> read

[GitHub] [hudi] dongkelun commented on pull request #4016: [HUDI-2675] Fix the exception 'Not an Avro data file' when archive and clean

2021-12-20 Thread GitBox


dongkelun commented on pull request #4016:
URL: https://github.com/apache/hudi/pull/4016#issuecomment-998466970


   > sure thanks @dongkelun . Looks like there is a write conflict. Can you 
rebase with latest master.
   
   OK, submit it with the test case later


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4364: [HUDI-3060] drop table for spark sql

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4364:
URL: https://github.com/apache/hudi/pull/4364#issuecomment-998466293


   
   ## CI report:
   
   * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4364: [HUDI-3060] drop table for spark sql

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4364:
URL: https://github.com/apache/hudi/pull/4364#issuecomment-998443381


   
   ## CI report:
   
   * b7eb121dc6ec52b4ca0e55e6db862a4c7e948004 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4616)
 
   * b2b949daa4dbe143ec9eaea029c5d295ce550f9d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4619)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1185) KeyGenerator class/interfaces need to be decoupled from Spark

2021-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-1185:
--
Status: Open  (was: In Progress)

> KeyGenerator class/interfaces need to be decoupled from Spark
> -
>
> Key: HUDI-1185
> URL: https://issues.apache.org/jira/browse/HUDI-1185
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> https://github.com/apache/hudi/pull/1834#discussion_r466386893 has the context



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462986#comment-17462986
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 4:37 AM:


Complete log files for Slow run (Metadata reader on)

[^stderr_part1.txt]

[^stderr_part2.txt]


was (Author: h7kanna):
Complete log files

[^stderr_part1.txt]

[^stderr_part2.txt]

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, 
> stderr_part2.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.lo

[jira] [Updated] (HUDI-2235) [UMBRELLA] Add virtual key support to Hudi

2021-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-2235:
--
Status: Open  (was: In Progress)

> [UMBRELLA] Add virtual key support to Hudi
> --
>
> Key: HUDI-2235
> URL: https://issues.apache.org/jira/browse/HUDI-2235
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: hudi-umbrellas
> Fix For: 0.11.0
>
>
> Add virtual key support to Hudi
>  
> meta fields should not be persisted and existing columns should be leveraged. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462986#comment-17462986
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 12/21/21, 4:36 AM:


Complete log files

[^stderr_part1.txt]

[^stderr_part2.txt]


was (Author: h7kanna):
Complete log files

 

[^stderr_part1.txt]

[^stderr_part1.txt][^stderr_part2.txt]

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, 
> stderr_part2.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-61

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: stderr_part2-1.txt

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, 
> stderr_part2.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
>

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: stderr_part1.txt
stderr_part2.txt

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, stderr_part1.txt, stderr_part2-1.txt, 
> stderr_part2.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to th

[jira] [Updated] (HUDI-3035) Unify Parquet writers

2021-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3035:
--
Priority: Critical  (was: Blocker)

> Unify Parquet writers
> -
>
> Key: HUDI-3035
> URL: https://issues.apache.org/jira/browse/HUDI-3035
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.11.0
>
>
> Currently we have at least 3 implementations of the ParquetWriters (which is 
> 3x more than we actually need):
> [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieParquetWriter.java]
> [https://github.com/apache/hudi/blob/master/hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowDataParquetWriter.java]
> [https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieInternalRowParquetWriter.java]
>  
> Implementations (while identical in principle) have diverged, essentially 
> living their own lifecycle.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462986#comment-17462986
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

Complete log files

 

[^stderr_part1.txt]

[^stderr_part1.txt][^stderr_part2.txt]

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatRead

[GitHub] [hudi] hudi-bot removed a comment on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields

2021-12-20 Thread GitBox


hudi-bot removed a comment on pull request #4308:
URL: https://github.com/apache/hudi/pull/4308#issuecomment-998443321


   
   ## CI report:
   
   * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4308: [HUDI-3008] Fixing HoodieFileIndex partition column parsing for nested fields

2021-12-20 Thread GitBox


hudi-bot commented on pull request #4308:
URL: https://github.com/apache/hudi/pull/4308#issuecomment-998465363


   
   ## CI report:
   
   * 168fb8f7ef94fceb84c0d4b867e74cca9db908b5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4598)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4620)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2989) Hive sync to Glue tables not updating S3 location

2021-12-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2989:
-
Status: In Progress  (was: Open)

> Hive sync to Glue tables not updating S3 location
> -
>
> Key: HUDI-2989
> URL: https://issues.apache.org/jira/browse/HUDI-2989
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.11.0, 0.10.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3082) [Phase 1] Unify MOR table access across Spark, Hive

2021-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3082:
--
Status: In Progress  (was: Open)

> [Phase 1] Unify MOR table access across Spark, Hive
> ---
>
> Key: HUDI-3082
> URL: https://issues.apache.org/jira/browse/HUDI-3082
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> This is Phase 1 of what outlined in HUDI-3081
>  
> The goal is 
>  * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, 
> {{{}RealtimeUnmergedRecordReader{}}})
>  ** _These Readers should only differ in the way they handle the payload, 
> everything else should remain constant_
>  * Abstract w/in common component (name TBD)
>  ** Listing current file-slices at the requested instant (handling the 
> timeline)
>  ** Creating Record Iterator for the provided file-slice



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3082) [Phase 1] Unify MOR table access across Spark, Hive

2021-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-3082:
-

Assignee: Alexey Kudinkin

> [Phase 1] Unify MOR table access across Spark, Hive
> ---
>
> Key: HUDI-3082
> URL: https://issues.apache.org/jira/browse/HUDI-3082
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
>
> This is Phase 1 of what outlined in HUDI-3081
>  
> The goal is 
>  * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, 
> {{{}RealtimeUnmergedRecordReader{}}})
>  ** _These Readers should only differ in the way they handle the payload, 
> everything else should remain constant_
>  * Abstract w/in common component (name TBD)
>  ** Listing current file-slices at the requested instant (handling the 
> timeline)
>  ** Creating Record Iterator for the provided file-slice



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3082) [Phase 1] Unify MOR table access across Spark, Hive

2021-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3082:
--
Priority: Blocker  (was: Major)

> [Phase 1] Unify MOR table access across Spark, Hive
> ---
>
> Key: HUDI-3082
> URL: https://issues.apache.org/jira/browse/HUDI-3082
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> This is Phase 1 of what outlined in HUDI-3081
>  
> The goal is 
>  * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, 
> {{{}RealtimeUnmergedRecordReader{}}})
>  ** _These Readers should only differ in the way they handle the payload, 
> everything else should remain constant_
>  * Abstract w/in common component (name TBD)
>  ** Listing current file-slices at the requested instant (handling the 
> timeline)
>  ** Creating Record Iterator for the provided file-slice



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3082) [Phase 1] Unify MOR table access across Spark, Hive

2021-12-20 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-3082:
-

 Summary: [Phase 1] Unify MOR table access across Spark, Hive
 Key: HUDI-3082
 URL: https://issues.apache.org/jira/browse/HUDI-3082
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin


This is Phase 1 of what outlined in HUDI-3081

 

The goal is 
 * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, 
{{{}RealtimeUnmergedRecordReader{}}})
 ** _These Readers should only differ in the way they handle the payload, 
everything else should remain constant_
 * Abstract w/in common component (name TBD)
 ** Listing current file-slices at the requested instant (handling the timeline)
 ** Creating Record Iterator for the provided file-slice



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3082) [Phase 1] Unify MOR table access across Spark, Hive

2021-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3082:
--
Issue Type: Improvement  (was: Bug)

> [Phase 1] Unify MOR table access across Spark, Hive
> ---
>
> Key: HUDI-3082
> URL: https://issues.apache.org/jira/browse/HUDI-3082
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> This is Phase 1 of what outlined in HUDI-3081
>  
> The goal is 
>  * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, 
> {{{}RealtimeUnmergedRecordReader{}}})
>  ** _These Readers should only differ in the way they handle the payload, 
> everything else should remain constant_
>  * Abstract w/in common component (name TBD)
>  ** Listing current file-slices at the requested instant (handling the 
> timeline)
>  ** Creating Record Iterator for the provided file-slice



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3082) [Phase 1] Unify MOR table access across Spark, Hive

2021-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3082:
--
Fix Version/s: 0.11.0

> [Phase 1] Unify MOR table access across Spark, Hive
> ---
>
> Key: HUDI-3082
> URL: https://issues.apache.org/jira/browse/HUDI-3082
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> This is Phase 1 of what outlined in HUDI-3081
>  
> The goal is 
>  * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, 
> {{{}RealtimeUnmergedRecordReader{}}})
>  ** _These Readers should only differ in the way they handle the payload, 
> everything else should remain constant_
>  * Abstract w/in common component (name TBD)
>  ** Listing current file-slices at the requested instant (handling the 
> timeline)
>  ** Creating Record Iterator for the provided file-slice



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] xushiyan closed pull request #4113: [HUDI-2735] Fix clean rollback archiving logic

2021-12-20 Thread GitBox


xushiyan closed pull request #4113:
URL: https://github.com/apache/hudi/pull/4113


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3081) [UMBRELLA] Revisiting Read Path Infra across Query Engines

2021-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3081:
--
Status: In Progress  (was: Open)

> [UMBRELLA] Revisiting Read Path Infra across Query Engines
> --
>
> Key: HUDI-3081
> URL: https://issues.apache.org/jira/browse/HUDI-3081
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> Currently, our Read-path infrastructure is mostly disparate for each 
> individual Query Engine having the same flow replicated multiple times: 
>  * Hive leverages hierarchy based off `InputFormat` class
>  * Spark leverages hierarchy based off `SnapshotRelation`
> This leads to substantial duplication of virtually the same flows being 
> replicated multiple times and unfortunately now diverging due to out of sync 
> lifecycle (bug-fixes, etc).
> h3. Proposal
>  
> *Phase 1: Abstracting Common Functionality*
>  
> {_}T-shirt{_}: 1-1.5 weeks
> {_}Goal{_}: Abstract following common items to avoid duplication of the 
> complex sequences across Engines
>   * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, 
> {{{}RealtimeUnmergedRecordReader{}}})
>  ** _These Readers should only differ in the way they handle the payload, 
> everything else should remain constant_
>  * Abstract w/in common component (name TBD)
>  ** Listing current file-slices at the requested instant (handling the 
> timeline)
>  ** Creating Record Iterator for the provided file-slice
>  
> *Phase 2: Revisiting Record Handling*
>  
> {_}T-shirt{_}: 1-1.5 weeks
> {_}Goal{_}: Avoid tight coupling with particular record representation on the 
> Read Path (currently Avro) and enable
>   * Common record handling API for combining records (Merge API)
>  * Avoiding unnecessary serde by abstracting away standardized Record access 
> routines (getting key, merging, etc)
>  ** Behind the interface we'd rely on engine-specific representation to carry 
> the payload (`InternalRow` for Spark, `ArrayWritable` for Hive, etc)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3081) [UMBRELLA] Revisiting Read Path Infra across Query Engines

2021-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-3081:
-

Assignee: Alexey Kudinkin

> [UMBRELLA] Revisiting Read Path Infra across Query Engines
> --
>
> Key: HUDI-3081
> URL: https://issues.apache.org/jira/browse/HUDI-3081
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
>
> Currently, our Read-path infrastructure is mostly disparate for each 
> individual Query Engine having the same flow replicated multiple times: 
>  * Hive leverages hierarchy based off `InputFormat` class
>  * Spark leverages hierarchy based off `SnapshotRelation`
> This leads to substantial duplication of virtually the same flows being 
> replicated multiple times and unfortunately now diverging due to out of sync 
> lifecycle (bug-fixes, etc).
> h3. Proposal
>  
> *Phase 1: Abstracting Common Functionality*
>  
> {_}T-shirt{_}: 1-1.5 weeks
> {_}Goal{_}: Abstract following common items to avoid duplication of the 
> complex sequences across Engines
>   * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, 
> {{{}RealtimeUnmergedRecordReader{}}})
>  ** _These Readers should only differ in the way they handle the payload, 
> everything else should remain constant_
>  * Abstract w/in common component (name TBD)
>  ** Listing current file-slices at the requested instant (handling the 
> timeline)
>  ** Creating Record Iterator for the provided file-slice
>  
> *Phase 2: Revisiting Record Handling*
>  
> {_}T-shirt{_}: 1-1.5 weeks
> {_}Goal{_}: Avoid tight coupling with particular record representation on the 
> Read Path (currently Avro) and enable
>   * Common record handling API for combining records (Merge API)
>  * Avoiding unnecessary serde by abstracting away standardized Record access 
> routines (getting key, merging, etc)
>  ** Behind the interface we'd rely on engine-specific representation to carry 
> the payload (`InternalRow` for Spark, `ArrayWritable` for Hive, etc)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3070) Improve Test

2021-12-20 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3070:
-
Component/s: Testing

> Improve Test
> 
>
> Key: HUDI-3070
> URL: https://issues.apache.org/jira/browse/HUDI-3070
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Yue Zhang
>Assignee: Yue Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0, 0.10.1
>
>
> Improve Robustness, robustness, stability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3081) [UMBRELLA] Revisiting Read Path Infra across Query Engines

2021-12-20 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3081:
--
Priority: Blocker  (was: Major)

> [UMBRELLA] Revisiting Read Path Infra across Query Engines
> --
>
> Key: HUDI-3081
> URL: https://issues.apache.org/jira/browse/HUDI-3081
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>
> Currently, our Read-path infrastructure is mostly disparate for each 
> individual Query Engine having the same flow replicated multiple times: 
>  * Hive leverages hierarchy based off `InputFormat` class
>  * Spark leverages hierarchy based off `SnapshotRelation`
> This leads to substantial duplication of virtually the same flows being 
> replicated multiple times and unfortunately now diverging due to out of sync 
> lifecycle (bug-fixes, etc).
> h3. Proposal
>  
> *Phase 1: Abstracting Common Functionality*
>  
> {_}T-shirt{_}: 1-1.5 weeks
> {_}Goal{_}: Abstract following common items to avoid duplication of the 
> complex sequences across Engines
>   * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, 
> {{{}RealtimeUnmergedRecordReader{}}})
>  ** _These Readers should only differ in the way they handle the payload, 
> everything else should remain constant_
>  * Abstract w/in common component (name TBD)
>  ** Listing current file-slices at the requested instant (handling the 
> timeline)
>  ** Creating Record Iterator for the provided file-slice
>  
> *Phase 2: Revisiting Record Handling*
>  
> {_}T-shirt{_}: 1-1.5 weeks
> {_}Goal{_}: Avoid tight coupling with particular record representation on the 
> Read Path (currently Avro) and enable
>   * Common record handling API for combining records (Merge API)
>  * Avoiding unnecessary serde by abstracting away standardized Record access 
> routines (getting key, merging, etc)
>  ** Behind the interface we'd rely on engine-specific representation to carry 
> the payload (`InternalRow` for Spark, `ArrayWritable` for Hive, etc)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3081) [UMBRELLA] Revisiting Read Path Infra across Query Engines

2021-12-20 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-3081:
-

 Summary: [UMBRELLA] Revisiting Read Path Infra across Query Engines
 Key: HUDI-3081
 URL: https://issues.apache.org/jira/browse/HUDI-3081
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin


Currently, our Read-path infrastructure is mostly disparate for each individual 
Query Engine having the same flow replicated multiple times: 
 * Hive leverages hierarchy based off `InputFormat` class
 * Spark leverages hierarchy based off `SnapshotRelation`

This leads to substantial duplication of virtually the same flows being 
replicated multiple times and unfortunately now diverging due to out of sync 
lifecycle (bug-fixes, etc).
h3. Proposal
 
*Phase 1: Abstracting Common Functionality*
 
{_}T-shirt{_}: 1-1.5 weeks
{_}Goal{_}: Abstract following common items to avoid duplication of the complex 
sequences across Engines
  * Unify Hive’s RecordReaders (`RealtimeCompactedRecordReader`, 
{{{}RealtimeUnmergedRecordReader{}}})
 ** _These Readers should only differ in the way they handle the payload, 
everything else should remain constant_
 * Abstract w/in common component (name TBD)
 ** Listing current file-slices at the requested instant (handling the timeline)
 ** Creating Record Iterator for the provided file-slice

 
*Phase 2: Revisiting Record Handling*
 
{_}T-shirt{_}: 1-1.5 weeks
{_}Goal{_}: Avoid tight coupling with particular record representation on the 
Read Path (currently Avro) and enable
  * Common record handling API for combining records (Merge API)
 * Avoiding unnecessary serde by abstracting away standardized Record access 
routines (getting key, merging, etc)
 ** Behind the interface we'd rely on engine-specific representation to carry 
the payload (`InternalRow` for Spark, `ArrayWritable` for Hive, etc)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3070) Improve Test

2021-12-20 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3070:
-
Fix Version/s: 0.11.0
   0.10.1

> Improve Test
> 
>
> Key: HUDI-3070
> URL: https://issues.apache.org/jira/browse/HUDI-3070
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yue Zhang
>Assignee: Yue Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0, 0.10.1
>
>
> Improve Robustness, robustness, stability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3070) Improve Test

2021-12-20 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3070.

  Assignee: Yue Zhang
Resolution: Done

> Improve Test
> 
>
> Key: HUDI-3070
> URL: https://issues.apache.org/jira/browse/HUDI-3070
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yue Zhang
>Assignee: Yue Zhang
>Priority: Major
>  Labels: pull-request-available
>
> Improve Robustness, robustness, stability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462981#comment-17462981
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

Metadata on reader side disabled

 

!Screen Shot 2021-12-20 at 10.17.44 PM.png!

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: Screen Shot 2021-12-20 at 10.17.44 PM-1.png

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM-1.png, 
> Screen Shot 2021-12-20 at 10.17.44 PM.png, metadata_timeline.txt, 
> metadata_timeline_archived.txt, timeline-1.txt, timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://dat

[jira] [Updated] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2021-12-20 Thread Harsha Teja Kanna (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsha Teja Kanna updated HUDI-3066:

Attachment: Screen Shot 2021-12-20 at 10.17.44 PM.png

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: Manoj Govindassamy
>Priority: Major
>  Labels: performance, pull-request-available
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> metadata_timeline.txt, metadata_timeline_archived.txt, timeline-1.txt, 
> timeline.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,110 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.77_0-35-590',
>  fileLen=0}
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613
>  at instant 20211216183448389
> 2021-12-18 23:37:46,112 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,118 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.fil

[jira] [Updated] (HUDI-735) Improve deltastreamer error message when case mismatch of commandline arguments.

2021-12-20 Thread Harshal Patil (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harshal Patil updated HUDI-735:
---
Status: Patch Available  (was: In Progress)

> Improve deltastreamer error message when case mismatch of commandline 
> arguments.
> 
>
> Key: HUDI-735
> URL: https://issues.apache.org/jira/browse/HUDI-735
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, DeltaStreamer, Usability
>Reporter: Vinoth Chandar
>Assignee: Harshal Patil
>Priority: Major
>  Labels: core-flow-ds, pull-request-available, sev:normal, 
> user-support-issues
>
> Team,
> When following the blog "Change Capture Using AWS Database Migration
> Service and Hudi" with my own data set, the initial load works perfectly.
> When issuing the command with the DMS CDC files on S3, I get the following
> error:
> {code}
> 20/03/24 17:56:28 ERROR HoodieDeltaStreamer: Got error running delta sync
> once. Shutting down
> org.apache.hudi.exception.HoodieException: Please provide a valid schema
> provider class! at
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)
>  at
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)
> at
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)
> {code}
> I tried using the  --schemaprovider-class
> org.apache.hudi.utilities.schema.FilebasedSchemaProvider.Source and provide
> the schema. The error does not occur but there are no write to Hudi.
> I am not performing any transformations (other than the DMS transform) and
> using default record key strategy.
> If the team has any pointers, please let me know.
> Thank you!
> ---
> Thank you Vinoth. I was able to find the issue. All my column names were in
> high caps case. I switched column names and table names to lower case and
> it works perfectly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   3   4   5   >