[GitHub] [hudi] bvaradar closed issue #2432: [SUPPORT] write hudi data failed when using Deltastreamer

2021-01-21 Thread GitBox


bvaradar closed issue #2432:
URL: https://github.com/apache/hudi/issues/2432


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2432: [SUPPORT] write hudi data failed when using Deltastreamer

2021-01-21 Thread GitBox


bvaradar commented on issue #2432:
URL: https://github.com/apache/hudi/issues/2432#issuecomment-765213298


   Closing due to inactivity !!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2423: Performance Issues due to significant Parallel Create-Dir being issued to Azure ADLS_V2

2021-01-21 Thread GitBox


bvaradar commented on issue #2423:
URL: https://github.com/apache/hudi/issues/2423#issuecomment-765213042


   @nsivabalan : Can you open a PR with code changes in 
https://github.com/apache/hudi/issues/2423#issuecomment-758433327 to have it 
landed ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2397: [SUPPORT]

2021-01-21 Thread GitBox


bvaradar commented on issue #2397:
URL: https://github.com/apache/hudi/issues/2397#issuecomment-765212048


   Please reopen if this is still an issue ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #2397: [SUPPORT]

2021-01-21 Thread GitBox


bvaradar closed issue #2397:
URL: https://github.com/apache/hudi/issues/2397


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2448: [SUPPORT] deltacommit for client 172.16.116.102 already exists

2021-01-21 Thread GitBox


bvaradar commented on issue #2448:
URL: https://github.com/apache/hudi/issues/2448#issuecomment-765211520


   @peng-xin : Can you enable hoodie.compact.inline -> true and 
hoodie.auto.commit -> true. The log files are growing because they need to be 
compacted and if you set the first config, it will periodically run 
compactions. Cleaner will eventually remove old log files and parquet files 
after that.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

2021-01-21 Thread GitBox


bvaradar commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-765206339


   @rubenssoto : The time taken is coming from the shuffling the data to route 
for writing to parquet file and the writing part.  I think if you increase the 
number of executors  (reduce cores per executor to 2) and try, it would be 
better. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on issue #2461: All records are present in athena query result on glue crawled Hudi tables

2021-01-21 Thread GitBox


vinothchandar commented on issue #2461:
URL: https://github.com/apache/hudi/issues/2461#issuecomment-765188290


   @vrtrepp you could potentially modify your existing glue step to run the 
hive-sync tool as an additonal step? At a high level, without the 
`HoodieParquetInputFormat` as the registered format, its impossible for athena 
to do the filtering necessary to just give you the latest records. 
   Just saw this today : 
https://aws.amazon.com/blogs/big-data/writing-to-apache-hudi-tables-using-aws-glue-connector/
 , not sure if its directly helpful for you



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua merged pull request #2459: [MINOR] Improve code readability,remove the continue keyword

2021-01-21 Thread GitBox


yanghua merged pull request #2459:
URL: https://github.com/apache/hudi/pull/2459


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on pull request #2450: [HUDI-1538] Try to init class trying different signatures instead of checking its name.

2021-01-21 Thread GitBox


yanghua commented on pull request #2450:
URL: https://github.com/apache/hudi/pull/2450#issuecomment-765144294


   @vburenin Would you please rebase with the master branch and see if the 
Travis is OK.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua merged pull request #2471: [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFli…

2021-01-21 Thread GitBox


yanghua merged pull request #2471:
URL: https://github.com/apache/hudi/pull/2471


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2471: [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFli…

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2471:
URL: https://github.com/apache/hudi/pull/2471#issuecomment-765116001


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=h1) Report
   > Merging 
[#2471](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=desc) (ef62c24) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/976420c49a3fe764f7cecd30b6cdb32861be5537?el=desc)
 (976420c) will **increase** coverage by `19.20%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2471/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2471   +/-   ##
   =
   + Coverage 50.27%   69.48%   +19.20% 
   + Complexity 3050  358 -2692 
   =
 Files   419   53  -366 
 Lines 18897 1930-16967 
 Branches   1937  230 -1707 
   =
   - Hits   9500 1341 -8159 
   + Misses 8622  456 -8166 
   + Partials775  133  -642 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.48% <ø> (+0.05%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...common/table/view/FileSystemViewStorageConfig.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvRmlsZVN5c3RlbVZpZXdTdG9yYWdlQ29uZmlnLmphdmE=)
 | | | |
   | 
[.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==)
 | | | |
   | 
[...udi/common/table/timeline/dto/FSPermissionDTO.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL2R0by9GU1Blcm1pc3Npb25EVE8uamF2YQ==)
 | | | |
   | 
[...rg/apache/hudi/common/util/SerializationUtils.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvU2VyaWFsaXphdGlvblV0aWxzLmphdmE=)
 | | | |
   | 
[...adoop/realtime/HoodieHFileRealtimeInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL0hvb2RpZUhGaWxlUmVhbHRpbWVJbnB1dEZvcm1hdC5qYXZh)
 | | | |
   | 
[...hudi/hadoop/hive/HoodieCombineHiveInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL2hpdmUvSG9vZGllQ29tYmluZUhpdmVJbnB1dEZvcm1hdC5qYXZh)
 | | | |
   | 
[...di/timeline/service/handlers/FileSliceHandler.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvaGFuZGxlcnMvRmlsZVNsaWNlSGFuZGxlci5qYXZh)
 | | | |
   | 
[...ain/scala/org/apache/hudi/cli/DedupeSparkJob.scala](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL2NsaS9EZWR1cGVTcGFya0pvYi5zY2FsYQ==)
 | | | |
   | 
[...rg/apache/hudi/metadata/MetadataPartitionType.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvTWV0YWRhdGFQYXJ0aXRpb25UeXBlLmphdmE=)
 | | | |
   | 
[...sioning/clean/CleanMetadataV1MigrationHandler.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvY2xlYW4vQ2xlYW5NZXRhZGF0YVYxTWlncmF0aW9uSGFuZGxlci5qYXZh)
 | | | |
   | ... and [355 
more](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact 

[GitHub] [hudi] leesf commented on pull request #2447: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

2021-01-21 Thread GitBox


leesf commented on pull request #2447:
URL: https://github.com/apache/hudi/pull/2447#issuecomment-765010589


   > @leesf Hello, I have a doubt now. I did not modify the code of 
`hudi-ineg-test`, but every time the check fails because of it, do you know why?
   
   would you please rebase to master since there is a flaky test fix against 
the master.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #2412: [HUDI-1512] Fix spark 2 unit tests failure with Spark 3

2021-01-21 Thread GitBox


garyli1019 commented on pull request #2412:
URL: https://github.com/apache/hudi/pull/2412#issuecomment-764705432


   should be ok to merge since we cut the release



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua merged pull request #2434: [HUDI-1511] InstantGenerateOperator support multiple parallelism

2021-01-21 Thread GitBox


yanghua merged pull request #2434:
URL: https://github.com/apache/hudi/pull/2434


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] jiangjiguang commented on issue #143: Tracking ticket for folks to be added to slack group

2021-01-21 Thread GitBox


jiangjiguang commented on issue #143:
URL: https://github.com/apache/hudi/issues/143#issuecomment-764596209


   Hi, could you add me to the slack group? My email is jiangjiguang...@163.com
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2388: [HUDI-1353] add incremental timeline support for pending clustering ops

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2388:
URL: https://github.com/apache/hudi/pull/2388#issuecomment-751907604


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2388?src=pr=h1) Report
   > Merging 
[#2388](https://codecov.io/gh/apache/hudi/pull/2388?src=pr=desc) (4423c0c) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/81ccb0c71ad2c17c5613698d4fb50f3b49b21fb4?el=desc)
 (81ccb0c) will **increase** coverage by `19.15%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2388/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2388?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2388   +/-   ##
   =
   + Coverage 50.27%   69.43%   +19.15% 
   + Complexity 3050  357 -2693 
   =
 Files   419   53  -366 
 Lines 18897 1930-16967 
 Branches   1937  230 -1707 
   =
   - Hits   9500 1340 -8160 
   + Misses 8622  456 -8166 
   + Partials775  134  -641 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.43% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2388?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../java/org/apache/hudi/common/util/CommitUtils.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQ29tbWl0VXRpbHMuamF2YQ==)
 | | | |
   | 
[...rg/apache/hudi/common/bloom/SimpleBloomFilter.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Jsb29tL1NpbXBsZUJsb29tRmlsdGVyLmphdmE=)
 | | | |
   | 
[...apache/hudi/common/model/HoodieCommitMetadata.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUNvbW1pdE1ldGFkYXRhLmphdmE=)
 | | | |
   | 
[...pache/hudi/io/storage/HoodieFileReaderFactory.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vc3RvcmFnZS9Ib29kaWVGaWxlUmVhZGVyRmFjdG9yeS5qYXZh)
 | | | |
   | 
[...i/common/table/log/block/HoodieHFileDataBlock.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVIRmlsZURhdGFCbG9jay5qYXZh)
 | | | |
   | 
[...hudi/common/config/DFSPropertiesConfiguration.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9ERlNQcm9wZXJ0aWVzQ29uZmlndXJhdGlvbi5qYXZh)
 | | | |
   | 
[.../org/apache/hudi/hadoop/utils/HoodieHiveUtils.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZUhpdmVVdGlscy5qYXZh)
 | | | |
   | 
[...e/hudi/common/model/HoodieRollingStatMetadata.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVJvbGxpbmdTdGF0TWV0YWRhdGEuamF2YQ==)
 | | | |
   | 
[...pache/hudi/common/model/HoodieArchivedLogFile.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUFyY2hpdmVkTG9nRmlsZS5qYXZh)
 | | | |
   | 
[.../hive/SlashEncodedHourPartitionValueExtractor.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvU2xhc2hFbmNvZGVkSG91clBhcnRpdGlvblZhbHVlRXh0cmFjdG9yLmphdmE=)
 | | | |
   | ... and [346 
more](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2468: [MINOR] Disabling problematic tests temporarily to stabilize CI

2021-01-21 Thread GitBox


vinothchandar commented on pull request #2468:
URL: https://github.com/apache/hudi/pull/2468#issuecomment-763990996


   https://travis-ci.com/github/vinothchandar/hudi/builds/213836867 
   Build passes here (ran on my fork, since apache queue is overloaded now). 
Landing. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2426: [HUDI-304] Configure spotless and java style

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2426:
URL: https://github.com/apache/hudi/pull/2426#issuecomment-757443274


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2426?src=pr=h1) Report
   > Merging 
[#2426](https://codecov.io/gh/apache/hudi/pull/2426?src=pr=desc) (9cb1268) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/e926c1a45ca95fa1911f6f88a0577554f2797760?el=desc)
 (e926c1a) will **decrease** coverage by `0.20%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2426/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2426?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2426  +/-   ##
   
   - Coverage 50.73%   50.52%   -0.21% 
   + Complexity 3064 3032  -32 
   
 Files   419  417   -2 
 Lines 1879718725  -72 
 Branches   1922 1917   -5 
   
   - Hits   9536 9461  -75 
   - Misses 8485 8489   +4 
   + Partials776  775   -1 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.28% <ø> (+0.01%)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.61% <ø> (-0.41%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `10.20% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.00% <ø> (-0.06%)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `66.07% <ø> (+0.16%)` | `0.00 <ø> (ø)` | |
   | hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `66.84% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.41% <ø> (-0.02%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2426?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/common/engine/TaskContextSupplier.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9UYXNrQ29udGV4dFN1cHBsaWVyLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...e/hudi/metadata/FileSystemBackedTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvRmlsZVN5c3RlbUJhY2tlZFRhYmxlTWV0YWRhdGEuamF2YQ==)
 | `0.00% <0.00%> (-94.60%)` | `0.00% <0.00%> (-13.00%)` | |
   | 
[...apache/hudi/common/engine/HoodieEngineContext.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9Ib29kaWVFbmdpbmVDb250ZXh0LmphdmE=)
 | `0.00% <0.00%> (-66.67%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...g/apache/hudi/common/function/FunctionWrapper.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Z1bmN0aW9uL0Z1bmN0aW9uV3JhcHBlci5qYXZh)
 | `0.00% <0.00%> (-10.53%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==)
 | `79.31% <0.00%> (-10.35%)` | `15.00% <0.00%> (-1.00%)` | |
   | 
[...c/main/java/org/apache/hudi/common/fs/FSUtils.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL0ZTVXRpbHMuamF2YQ==)
 | `45.71% <0.00%> (-6.18%)` | `55.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/common/table/HoodieTableMetaClient.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlTWV0YUNsaWVudC5qYXZh)
 | `68.96% <0.00%> (-1.73%)` | `43.00% <0.00%> (-2.00%)` | |
   | 
[...i/common/table/timeline/TimelineMetadataUtils.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL1RpbWVsaW5lTWV0YWRhdGFVdGlscy5qYXZh)
 | `71.69% <0.00%> (-1.03%)` | `17.00% <0.00%> (ø%)` | |
   | 
[...rg/apache/hudi/hadoop/HoodieROTablePathFilter.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0hvb2RpZVJPVGFibGVQYXRoRmlsdGVyLmphdmE=)
 | `63.15% <0.00%> (-0.48%)` | `12.00% <0.00%> (ø%)` | |
   | 

[GitHub] [hudi] bvaradar commented on issue #2464: [SUPPORT] ExecutorLostFailure - Try Processing 1TB Of Data

2021-01-21 Thread GitBox


bvaradar commented on issue #2464:
URL: https://github.com/apache/hudi/issues/2464#issuecomment-763888673


   @rubenssoto : This would need trial and error. Can you try with 6GB+ and  
see if the error goes away ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

2021-01-21 Thread GitBox


rubenssoto commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-764683609


   Hi @bvaradar ,
   
   Could you helpme with this ? :)
   
   thank you



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] jessica0530 commented on issue #143: Tracking ticket for folks to be added to slack group

2021-01-21 Thread GitBox


jessica0530 commented on issue #143:
URL: https://github.com/apache/hudi/issues/143#issuecomment-764415027


   please add me  wjxdtc10...@gmail.com thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] teeyog commented on pull request #2447: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

2021-01-21 Thread GitBox


teeyog commented on pull request #2447:
URL: https://github.com/apache/hudi/pull/2447#issuecomment-764544455


   @leesf  Hello, I have a doubt now. I did not modify the code of 
```hudi-ineg-test```, but every time the check fails because of it, do you know 
why?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Trevor-zhang commented on a change in pull request #2449: [HUDI-1528] hudi-sync-tools supports synchronization to remote hive

2021-01-21 Thread GitBox


Trevor-zhang commented on a change in pull request #2449:
URL: https://github.com/apache/hudi/pull/2449#discussion_r561621697



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##
@@ -284,6 +284,9 @@ public static HiveSyncConfig 
buildHiveSyncConfig(TypedProperties props, String b
 props.getString(DataSourceWriteOptions.HIVE_PASS_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_PASS_OPT_VAL());
 hiveSyncConfig.jdbcUrl =
 props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_URL_OPT_VAL());
+if (hiveSyncConfig.hiveMetaStoreUri != null) {
+  hiveSyncConfig.hiveMetaStoreUri = 
props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_METASTORE_URI_OPT_VAL());
+}

Review comment:
   Because this is not a required option .

##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##
@@ -284,6 +284,9 @@ public static HiveSyncConfig 
buildHiveSyncConfig(TypedProperties props, String b
 props.getString(DataSourceWriteOptions.HIVE_PASS_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_PASS_OPT_VAL());
 hiveSyncConfig.jdbcUrl =
 props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_URL_OPT_VAL());
+if (hiveSyncConfig.hiveMetaStoreUri != null) {
+  hiveSyncConfig.hiveMetaStoreUri = 
props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_METASTORE_URI_OPT_VAL());
+}

Review comment:
   (1) When synchronizing hudi data with local hive,` 
hiveSyncConfig.hiveMetaStoreUri` can be set to null.
   (2) There is no priority distinction between `hiveSyncConfig` and `props`. 
Because there is no props attribute in `hiveSyncConfig`, only a single 
attribute is set.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on pull request #2454: [HUDI-393] Set up CI with Azure Pipelines

2021-01-21 Thread GitBox


xushiyan commented on pull request #2454:
URL: https://github.com/apache/hudi/pull/2454#issuecomment-764489544


   auto-closed due to personal repo force update... will re-create PR once 
tried out some settings



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2434: [HUDI-1511] InstantGenerateOperator support multiple parallelism

2021-01-21 Thread GitBox


vinothchandar commented on pull request #2434:
URL: https://github.com/apache/hudi/pull/2434#issuecomment-764820382


   @yanghua should be good to land now if you are happy with it



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 merged pull request #2412: [HUDI-1512] Fix spark 2 unit tests failure with Spark 3

2021-01-21 Thread GitBox


garyli1019 merged pull request #2412:
URL: https://github.com/apache/hudi/pull/2412


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2459: [MINOR] Improve code readability,remove the continue keyword

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2459:
URL: https://github.com/apache/hudi/pull/2459#issuecomment-763007666







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto commented on issue #2464: [SUPPORT] ExecutorLostFailure - Try Processing 1TB Of Data

2021-01-21 Thread GitBox


rubenssoto commented on issue #2464:
URL: https://github.com/apache/hudi/issues/2464#issuecomment-764682521


   Thank you so much for your help @bvaradar , it worked.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io commented on pull request #2471: [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFli…

2021-01-21 Thread GitBox


codecov-io commented on pull request #2471:
URL: https://github.com/apache/hudi/pull/2471#issuecomment-765116001


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=h1) Report
   > Merging 
[#2471](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=desc) (ef62c24) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/976420c49a3fe764f7cecd30b6cdb32861be5537?el=desc)
 (976420c) will **decrease** coverage by `40.58%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2471/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2471   +/-   ##
   
   - Coverage 50.27%   9.68%   -40.59% 
   + Complexity 3050  48 -3002 
   
 Files   419  53  -366 
 Lines 188971930-16967 
 Branches   1937 230 -1707 
   
   - Hits   9500 187 -9313 
   + Misses 86221730 -6892 
   + Partials775  13  -762 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <ø> (-59.75%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00% <0.00%> 

[GitHub] [hudi] wangxianghu commented on a change in pull request #2375: [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client

2021-01-21 Thread GitBox


wangxianghu commented on a change in pull request #2375:
URL: https://github.com/apache/hudi/pull/2375#discussion_r562308054



##
File path: 
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/index/bloom/HoodieFlinkBloomIndexCheckFunction.java
##
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.client.utils.LazyIterableIterator;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIndexException;
+import org.apache.hudi.io.HoodieKeyLookupHandle;
+import org.apache.hudi.io.HoodieKeyLookupHandle.KeyLookupResult;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.function.Function;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+import scala.Tuple2;
+
+/**
+ * Function performing actual checking of List partition containing (fileId, 
hoodieKeys) against the actual files.
+ */
+//TODO we can move this class into the hudi-client-common and reuse it for 
spark client
+public class HoodieFlinkBloomIndexCheckFunction
+implements Function>, 
Iterator>> {
+
+  private final HoodieTable hoodieTable;
+
+  private final HoodieWriteConfig config;
+
+  public HoodieFlinkBloomIndexCheckFunction(HoodieTable hoodieTable, 
HoodieWriteConfig config) {
+this.hoodieTable = hoodieTable;
+this.config = config;
+  }
+
+  @Override
+  public Iterator> apply(Iterator> fileParitionRecordKeyTripletItr) {
+return new LazyKeyCheckIterator(fileParitionRecordKeyTripletItr);
+  }
+
+  @Override
+  public  Function>> compose(Function>> before) {
+return null;
+  }
+
+  @Override
+  public  Function>, V> 
andThen(Function>, ? extends V> after) {
+return null;
+  }
+
+  class LazyKeyCheckIterator extends LazyIterableIterator, List> {

Review comment:
   > maybe we can move HoodieFlinkBloomIndexCheckFunction into the 
hudi-client-common later then spark can reuse it.
   
   yes, could be annother pr





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2449: [HUDI-1528] hudi-sync-tools supports synchronization to remote hive

2021-01-21 Thread GitBox


danny0405 commented on a change in pull request #2449:
URL: https://github.com/apache/hudi/pull/2449#discussion_r561587235



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##
@@ -284,6 +284,9 @@ public static HiveSyncConfig 
buildHiveSyncConfig(TypedProperties props, String b
 props.getString(DataSourceWriteOptions.HIVE_PASS_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_PASS_OPT_VAL());
 hiveSyncConfig.jdbcUrl =
 props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_URL_OPT_VAL());
+if (hiveSyncConfig.hiveMetaStoreUri != null) {
+  hiveSyncConfig.hiveMetaStoreUri = 
props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_METASTORE_URI_OPT_VAL());
+}

Review comment:
   What is the purpose of the decision `hiveSyncConfig.hiveMetaStoreUri != 
null` ?

##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##
@@ -284,6 +284,9 @@ public static HiveSyncConfig 
buildHiveSyncConfig(TypedProperties props, String b
 props.getString(DataSourceWriteOptions.HIVE_PASS_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_PASS_OPT_VAL());
 hiveSyncConfig.jdbcUrl =
 props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_URL_OPT_VAL());
+if (hiveSyncConfig.hiveMetaStoreUri != null) {
+  hiveSyncConfig.hiveMetaStoreUri = 
props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_METASTORE_URI_OPT_VAL());
+}

Review comment:
   Looks weird, in which case the `hiveSyncConfig.hiveMetaStoreUri` can be 
null and what is the priority between the `hiveSyncConfig` and the `props` ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#issuecomment-757736411







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2325: [HUDI-699]Fix CompactionCommand and add unit test for CompactionCommand

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2325:
URL: https://github.com/apache/hudi/pull/2325#issuecomment-742860619







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar merged pull request #2469: [MINOR] Make a separate travis CI job for hudi-utilities

2021-01-21 Thread GitBox


vinothchandar merged pull request #2469:
URL: https://github.com/apache/hudi/pull/2469


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on pull request #2434: [HUDI-1511] InstantGenerateOperator support multiple parallelism

2021-01-21 Thread GitBox


yanghua commented on pull request #2434:
URL: https://github.com/apache/hudi/pull/2434#issuecomment-765047901


   > @yanghua should be good to land now if you are happy with it
   
   ack, thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r561768669



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteFunction.java
##
@@ -0,0 +1,344 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.client.FlinkTaskContextSupplier;
+import org.apache.hudi.client.HoodieFlinkWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieFlinkEngineContext;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.util.ObjectSizeCalculator;
+import org.apache.hudi.keygen.KeyGenerator;
+import org.apache.hudi.operator.event.BatchWriteSuccessEvent;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.flink.annotation.VisibleForTesting;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.formats.avro.RowDataToAvroConverters;
+import org.apache.flink.runtime.operators.coordination.OperatorEventGateway;
+import org.apache.flink.runtime.state.FunctionInitializationContext;
+import org.apache.flink.runtime.state.FunctionSnapshotContext;
+import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;
+import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
+import org.apache.flink.table.types.logical.RowType;
+import org.apache.flink.util.Collector;
+import org.apache.flink.util.Preconditions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.Random;
+import java.util.concurrent.locks.Condition;
+import java.util.concurrent.locks.ReentrantLock;
+import java.util.function.BiFunction;
+
+/**
+ * Sink function to write the data to the underneath filesystem.
+ *
+ * Work Flow
+ *
+ * The function firstly buffers the data as a batch of {@link 
HoodieRecord}s,
+ * It flushes(write) the records batch when a Flink checkpoint starts. After a 
batch has been written successfully,
+ * the function notifies its operator coordinator {@link 
StreamWriteOperatorCoordinator} to mark a successful write.
+ *
+ * Exactly-once Semantics
+ *
+ * The task implements exactly-once semantics by buffering the data between 
checkpoints. The operator coordinator
+ * starts a new instant on the time line when a checkpoint triggers, the 
coordinator checkpoints always
+ * start before its operator, so when this function starts a checkpoint, a 
REQUESTED instant already exists.
+ * The function process thread then block data buffering and the checkpoint 
thread starts flushing the existing data buffer.
+ * When the existing data buffer write successfully, the process thread 
unblock and start buffering again for the next round checkpoint.
+ * Because any checkpoint failures would trigger the write rollback, it 
implements the exactly-once semantics.
+ *
+ * Fault Tolerance
+ *
+ * The operator coordinator checks the validity for the last instant when 
it starts a new one. The operator rolls back
+ * the written data and throws when any error occurs. This means any 
checkpoint or task failure would trigger a failover.
+ * The operator coordinator would try several times when committing the 
writestatus.
+ *
+ * Note: The function task requires the input stream be partitioned by the 
partition fields to avoid different write tasks
+ * write to the same file group that conflict. The general case for partition 
path is a datetime field,
+ * so the sink task is very possible to have IO bottleneck, the more flexible 
solution is to shuffle the
+ * data by the file group IDs.
+ *
+ * @param  Type of the input record
+ * @see StreamWriteOperatorCoordinator
+ */
+public class StreamWriteFunction extends KeyedProcessFunction implements CheckpointedFunction {
+
+  private static 

[GitHub] [hudi] vinothchandar commented on pull request #2469: [MINOR] Make a separate travis CI job for hudi-utilities

2021-01-21 Thread GitBox


vinothchandar commented on pull request #2469:
URL: https://github.com/apache/hudi/pull/2469#issuecomment-764395733







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2447: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2447:
URL: https://github.com/apache/hudi/pull/2447#issuecomment-760949326







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] teeyog commented on a change in pull request #2431: [HUDI-1526]translate the api partitionBy to hoodie.datasource.write.partitionpath.field

2021-01-21 Thread GitBox


teeyog commented on a change in pull request #2431:
URL: https://github.com/apache/hudi/pull/2431#discussion_r561454150



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -46,6 +46,11 @@ class DefaultSource extends RelationProvider
   with StreamSinkProvider
   with Serializable {
 
+  SparkSession.getActiveSession.foreach { spark =>
+// Enable "passPartitionByAsOptions" to support "write.partitionBy(...)"
+spark.conf.set("spark.sql.legacy.sources.write.passPartitionByAsOptions", 
"true")

Review comment:
   Thank you for your review, todo has been added





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #2471: [MINOR]Remove InstantGeneratorOperator parallelism limit in HoodieFli…

2021-01-21 Thread GitBox


wangxianghu commented on pull request #2471:
URL: https://github.com/apache/hudi/pull/2471#issuecomment-765067817


   @yanghua @loukey-lj please take a look when free



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated (748dcc9 -> 048633d)

2021-01-21 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 748dcc9  [MINOR] Remove InstantGeneratorOperator parallelism limit in 
HoodieFlinkStreamer and update docs (#2471)
 add 048633d  [MINOR] Improve code readability,remove the continue keyword 
(#2459)

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/hudi/integ/testsuite/reader/DFSDeltaInputReader.java | 1 -
 1 file changed, 1 deletion(-)



[hudi] branch master updated (641abe8 -> 748dcc9)

2021-01-21 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 641abe8  [HUDI-1332] Introduce FlinkHoodieBloomIndex to 
hudi-flink-client (#2375)
 add 748dcc9  [MINOR] Remove InstantGeneratorOperator parallelism limit in 
HoodieFlinkStreamer and update docs (#2471)

No new revisions were added by this update.

Summary of changes:
 .../src/main/java/org/apache/hudi/HoodieFlinkStreamer.java  |  3 +--
 .../org/apache/hudi/operator/InstantGenerateOperator.java   | 13 ++---
 2 files changed, 7 insertions(+), 9 deletions(-)



[GitHub] [hudi] xushiyan closed pull request #2454: [HUDI-393] Set up CI with Azure Pipelines

2021-01-21 Thread GitBox


xushiyan closed pull request #2454:
URL: https://github.com/apache/hudi/pull/2454


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2374:
URL: https://github.com/apache/hudi/pull/2374#issuecomment-750782300


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=h1) Report
   > Merging 
[#2374](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=desc) (7ababea) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/a38612b10f6ae04644519270f9b5eb631a77c148?el=desc)
 (a38612b) will **decrease** coverage by `41.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2374/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2374   +/-   ##
   
   - Coverage 50.69%   9.68%   -41.01% 
   + Complexity 3059  48 -3011 
   
 Files   419  53  -366 
 Lines 188101930-16880 
 Branches   1924 230 -1694 
   
   - Hits   9535 187 -9348 
   + Misses 84981730 -6768 
   + Partials777  13  -764 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <100.00%> (-59.80%)` | `0.00 <1.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==)
 | `83.62% <100.00%> (-5.18%)` | `28.00 <1.00> (ø)` | |
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` 

[jira] [Commented] (HUDI-1503) Implement a Hash(Bucket)-based Index

2021-01-21 Thread Mihir Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269828#comment-17269828
 ] 

Mihir Shah commented on HUDI-1503:
--

h4. [Shimin 
Yang|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=dangdangdang]

Hello Mr. Yang,

I would be interested in working on this issue, I was wondering if there is 
some documentation about the index or the project's design so I could 
understand the problem better?

Thank you!

> Implement a Hash(Bucket)-based Index
> 
>
> Key: HUDI-1503
> URL: https://issues.apache.org/jira/browse/HUDI-1503
> Project: Apache Hudi
>  Issue Type: Wish
>  Components: Index, Performance
>Reporter: Shimin Yang
>Priority: Major
>
> This ticket is to introduce a new hash based index, which can improve the 
> performance of  write operations and speed up the queries at the same 
> time(removing shuffle for Spark/Hive).
> The new hash-based index works with a customized hash-based partitioner, 
> which partition records based on the hash value of index keys and a fixed 
> bucket number. So there's no need to visit the existing files to determine 
> which file group each record belongs.
> Meanwhile, the file group id, hash mode and bucket num can be used by the 
> query engines to eliminate shuffle introduced by aggregation and join.
> We implemented an HoodieIndex based on hive hash function which used on 
> production environment of ByteDance for many very-large volume dataset, and 
> we hope this feature can be contributed to the community soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vrtrepp commented on issue #2461: All records are present in athena query result on glue crawled Hudi tables

2021-01-21 Thread GitBox


vrtrepp commented on issue #2461:
URL: https://github.com/apache/hudi/issues/2461#issuecomment-76438


   Hi Rubenssoto,
   That is how we are planning but it will involve writing few more steps in 
the pipeline.However our current architecture is based on running glue crawlers 
and removing Glue crawlers will come with making changes in many pipelines 
again a month's task atleast.
   
   What I was curious about will there be any support that Hudi is going to add 
in future ? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 merged pull request #2375: [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client

2021-01-21 Thread GitBox


garyli1019 merged pull request #2375:
URL: https://github.com/apache/hudi/pull/2375


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhedoubushishi commented on pull request #2431: [HUDI-1526]translate the api partitionBy to hoodie.datasource.write.partitionpath.field

2021-01-21 Thread GitBox


zhedoubushishi commented on pull request #2431:
URL: https://github.com/apache/hudi/pull/2431#issuecomment-764178360


   LGTM! This is also something I plan to do. Let's wait for others' review.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


yanghua commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r561716925



##
File path: hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java
##
@@ -0,0 +1,249 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.streamer.FlinkStreamerConfig;
+import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.keygen.SimpleAvroKeyGenerator;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.flink.configuration.ConfigOption;
+import org.apache.flink.configuration.ConfigOptions;
+import org.apache.flink.configuration.Configuration;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Hoodie Flink config options.
+ *
+ * It has the options for Hoodie table read and write. It also defines some 
utilities.
+ */
+public class FlinkOptions {
+  private FlinkOptions() {
+  }
+
+  // 
+  //  Base Options
+  // 
+  public static final ConfigOption PATH = ConfigOptions
+  .key("path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Base path for the target hoodie table."
+  + "\nThe path would be created if it does not exist,\n"
+  + "otherwise a Hoodie table expects to be initialized successfully");
+
+  public static final ConfigOption PROPS_FILE_PATH = ConfigOptions
+  .key("properties-file.path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Path to properties file on local-fs or dfs, with 
configurations for \n"
+  + "hoodie client, schema provider, key generator and data source. 
For hoodie client props, sane defaults are\n"
+  + "used, but recommend use to provide basic things like metrics 
endpoints, hive configs etc. For sources, refer\n"
+  + "to individual classes, for supported properties");
+
+  // 
+  //  Read Options
+  // 
+  public static final ConfigOption READ_SCHEMA_FILE_PATH = 
ConfigOptions
+  .key("read.schema.file.path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Avro schema file path, the parsed schema is used for 
deserializing");
+
+  // 
+  //  Write Options
+  // 
+  public static final ConfigOption TABLE_NAME = ConfigOptions
+  .key(HoodieWriteConfig.TABLE_NAME)
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Table name to register to Hive metastore");
+
+  public static final ConfigOption TABLE_TYPE = ConfigOptions

Review comment:
   So can we make `COPY_ON_WRITE` and `copy_on_write` equivalence?

##
File path: hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java
##
@@ -0,0 +1,249 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.config.HoodieWriteConfig;

[GitHub] [hudi] rubenssoto closed issue #2464: [SUPPORT] ExecutorLostFailure - Try Processing 1TB Of Data

2021-01-21 Thread GitBox


rubenssoto closed issue #2464:
URL: https://github.com/apache/hudi/issues/2464


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2431: [HUDI-1526]translate the api partitionBy to hoodie.datasource.write.partitionpath.field

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2431:
URL: https://github.com/apache/hudi/pull/2431#issuecomment-757929313


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2431?src=pr=h1) Report
   > Merging 
[#2431](https://codecov.io/gh/apache/hudi/pull/2431?src=pr=desc) (f1d0fda) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/a38612b10f6ae04644519270f9b5eb631a77c148?el=desc)
 (a38612b) will **decrease** coverage by `41.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2431/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2431?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2431   +/-   ##
   
   - Coverage 50.69%   9.68%   -41.01% 
   + Complexity 3059  48 -3011 
   
 Files   419  53  -366 
 Lines 188101930-16880 
 Branches   1924 230 -1694 
   
   - Hits   9535 187 -9348 
   + Misses 84981730 -6768 
   + Partials777  13  -764 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <100.00%> (-59.80%)` | `0.00 <1.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2431?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==)
 | `83.62% <100.00%> (-5.18%)` | `28.00 <1.00> (ø)` | |
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2431/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | 

[GitHub] [hudi] vinothchandar commented on issue #2470: [SUPPORT] Heavy skew in ListingBasedRollbackHelper

2021-01-21 Thread GitBox


vinothchandar commented on issue #2470:
URL: https://github.com/apache/hudi/issues/2470#issuecomment-764870104







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar merged pull request #2468: [MINOR] Disabling problematic tests temporarily to stabilize CI

2021-01-21 Thread GitBox


vinothchandar merged pull request #2468:
URL: https://github.com/apache/hudi/pull/2468


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2260: [HUDI-1381] Schedule compaction based on time elapsed

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2260:
URL: https://github.com/apache/hudi/pull/2260#issuecomment-729530724







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhedoubushishi commented on a change in pull request #2431: [HUDI-1526]translate the api partitionBy to hoodie.datasource.write.partitionpath.field

2021-01-21 Thread GitBox


zhedoubushishi commented on a change in pull request #2431:
URL: https://github.com/apache/hudi/pull/2431#discussion_r561334344



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -46,6 +46,11 @@ class DefaultSource extends RelationProvider
   with StreamSinkProvider
   with Serializable {
 
+  SparkSession.getActiveSession.foreach { spark =>
+// Enable "passPartitionByAsOptions" to support "write.partitionBy(...)"
+spark.conf.set("spark.sql.legacy.sources.write.passPartitionByAsOptions", 
"true")

Review comment:
   Could you also add a TODO comment here to indicate that we can remove 
this line after upgrading to Spark 3? I think in the future, Hudi will move to 
Spark 3.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1389) [UMBRELLA] Survey indexing technique for better query performance

2021-01-21 Thread Mihir Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269794#comment-17269794
 ] 

Mihir Shah commented on HUDI-1389:
--

[Raymond 
Xu|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=xushiyan]

Hello, Mr. Xu,

This project seems very interesting and I would be interested in working on it. 
I am experienced with C/C++, Java, Python (and ML packages),  have worked with 
SQL, Bash, etc as well as some functional languages like Haskell, and have some 
experience with data mining. Could you please let me know how I could start on 
this project?

Thank you!

> [UMBRELLA] Survey indexing technique for better query performance
> -
>
> Key: HUDI-1389
> URL: https://issues.apache.org/jira/browse/HUDI-1389
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Raymond Xu
>Priority: Major
>  Labels: gsoc, gsoc2021, mentor
>
> (More details to be added)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] codecov-io edited a comment on pull request #2447: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2447:
URL: https://github.com/apache/hudi/pull/2447#issuecomment-760949326







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2471: [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFli…

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2471:
URL: https://github.com/apache/hudi/pull/2471#issuecomment-765116001


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=h1) Report
   > Merging 
[#2471](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=desc) (ef62c24) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/976420c49a3fe764f7cecd30b6cdb32861be5537?el=desc)
 (976420c) will **increase** coverage by `19.20%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2471/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2471   +/-   ##
   =
   + Coverage 50.27%   69.48%   +19.20% 
   + Complexity 3050  358 -2692 
   =
 Files   419   53  -366 
 Lines 18897 1930-16967 
 Branches   1937  230 -1707 
   =
   - Hits   9500 1341 -8159 
   + Misses 8622  456 -8166 
   + Partials775  133  -642 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.48% <ø> (+0.05%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...common/table/view/FileSystemViewStorageConfig.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvRmlsZVN5c3RlbVZpZXdTdG9yYWdlQ29uZmlnLmphdmE=)
 | | | |
   | 
[.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==)
 | | | |
   | 
[...udi/common/table/timeline/dto/FSPermissionDTO.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL2R0by9GU1Blcm1pc3Npb25EVE8uamF2YQ==)
 | | | |
   | 
[...rg/apache/hudi/common/util/SerializationUtils.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvU2VyaWFsaXphdGlvblV0aWxzLmphdmE=)
 | | | |
   | 
[...adoop/realtime/HoodieHFileRealtimeInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL0hvb2RpZUhGaWxlUmVhbHRpbWVJbnB1dEZvcm1hdC5qYXZh)
 | | | |
   | 
[...hudi/hadoop/hive/HoodieCombineHiveInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL2hpdmUvSG9vZGllQ29tYmluZUhpdmVJbnB1dEZvcm1hdC5qYXZh)
 | | | |
   | 
[...di/timeline/service/handlers/FileSliceHandler.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvaGFuZGxlcnMvRmlsZVNsaWNlSGFuZGxlci5qYXZh)
 | | | |
   | 
[...ain/scala/org/apache/hudi/cli/DedupeSparkJob.scala](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL2NsaS9EZWR1cGVTcGFya0pvYi5zY2FsYQ==)
 | | | |
   | 
[...rg/apache/hudi/metadata/MetadataPartitionType.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvTWV0YWRhdGFQYXJ0aXRpb25UeXBlLmphdmE=)
 | | | |
   | 
[...sioning/clean/CleanMetadataV1MigrationHandler.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvY2xlYW4vQ2xlYW5NZXRhZGF0YVYxTWlncmF0aW9uSGFuZGxlci5qYXZh)
 | | | |
   | ... and [355 
more](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact 

[GitHub] [hudi] codecov-io edited a comment on pull request #2325: [HUDI-699]Fix CompactionCommand and add unit test for CompactionCommand

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2325:
URL: https://github.com/apache/hudi/pull/2325#issuecomment-742860619


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2325?src=pr=h1) Report
   > Merging 
[#2325](https://codecov.io/gh/apache/hudi/pull/2325?src=pr=desc) (333436e) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/749f6578561cbf065c7f74ab51b1c01881a1bd97?el=desc)
 (749f657) will **increase** coverage by `18.71%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2325/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2325?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2325   +/-   ##
   =
   + Coverage 50.71%   69.43%   +18.71% 
   + Complexity 3060  357 -2703 
   =
 Files   419   53  -366 
 Lines 18796 1930-16866 
 Branches   1922  230 -1692 
   =
   - Hits   9533 1340 -8193 
   + Misses 8488  456 -8032 
   + Partials775  134  -641 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.43% <ø> (-0.06%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2325?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2325/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `70.50% <0.00%> (-0.36%)` | `50.00% <0.00%> (-1.00%)` | |
   | 
[...di/common/table/log/block/HoodieAvroDataBlock.java](https://codecov.io/gh/apache/hudi/pull/2325/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVBdnJvRGF0YUJsb2NrLmphdmE=)
 | | | |
   | 
[...hadoop/LocatedFileStatusWithBootstrapBaseFile.java](https://codecov.io/gh/apache/hudi/pull/2325/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0xvY2F0ZWRGaWxlU3RhdHVzV2l0aEJvb3RzdHJhcEJhc2VGaWxlLmphdmE=)
 | | | |
   | 
[...rg/apache/hudi/cli/commands/SavepointsCommand.java](https://codecov.io/gh/apache/hudi/pull/2325/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1NhdmVwb2ludHNDb21tYW5kLmphdmE=)
 | | | |
   | 
[...til/jvm/HotSpotMemoryLayoutSpecification64bit.java](https://codecov.io/gh/apache/hudi/pull/2325/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvanZtL0hvdFNwb3RNZW1vcnlMYXlvdXRTcGVjaWZpY2F0aW9uNjRiaXQuamF2YQ==)
 | | | |
   | 
[...rg/apache/hudi/metadata/MetadataPartitionType.java](https://codecov.io/gh/apache/hudi/pull/2325/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvTWV0YWRhdGFQYXJ0aXRpb25UeXBlLmphdmE=)
 | | | |
   | 
[...in/java/org/apache/hudi/hive/HoodieHiveClient.java](https://codecov.io/gh/apache/hudi/pull/2325/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSG9vZGllSGl2ZUNsaWVudC5qYXZh)
 | | | |
   | 
[...e/hudi/metadata/FileSystemBackedTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2325/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvRmlsZVN5c3RlbUJhY2tlZFRhYmxlTWV0YWRhdGEuamF2YQ==)
 | | | |
   | 
[...hudi/common/model/HoodieReplaceCommitMetadata.java](https://codecov.io/gh/apache/hudi/pull/2325/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVJlcGxhY2VDb21taXRNZXRhZGF0YS5qYXZh)
 | | | |
   | 
[...che/hudi/common/table/timeline/dto/InstantDTO.java](https://codecov.io/gh/apache/hudi/pull/2325/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL2R0by9JbnN0YW50RFRPLmphdmE=)
 | | | |
   | ... and [352 
more](https://codecov.io/gh/apache/hudi/pull/2325/diff?src=pr=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this 

[GitHub] [hudi] codecov-io edited a comment on pull request #2447: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2447:
URL: https://github.com/apache/hudi/pull/2447#issuecomment-760949326


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2447?src=pr=h1) Report
   > Merging 
[#2447](https://codecov.io/gh/apache/hudi/pull/2447?src=pr=desc) (b5aad46) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/a38612b10f6ae04644519270f9b5eb631a77c148?el=desc)
 (a38612b) will **increase** coverage by `18.73%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2447/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2447?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2447   +/-   ##
   =
   + Coverage 50.69%   69.43%   +18.73% 
   + Complexity 3059  357 -2702 
   =
 Files   419   53  -366 
 Lines 18810 1930-16880 
 Branches   1924  230 -1694 
   =
   - Hits   9535 1340 -8195 
   + Misses 8498  456 -8042 
   + Partials777  134  -643 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.43% <100.00%> (-0.06%)` | `0.00 <1.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2447?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2447/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==)
 | `88.79% <100.00%> (ø)` | `28.00 <1.00> (ø)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2447/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `70.50% <0.00%> (-0.36%)` | `50.00% <0.00%> (-1.00%)` | |
   | 
[...a/org/apache/hudi/common/util/Base64CodecUtil.java](https://codecov.io/gh/apache/hudi/pull/2447/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQmFzZTY0Q29kZWNVdGlsLmphdmE=)
 | | | |
   | 
[...rg/apache/hudi/metadata/MetadataPartitionType.java](https://codecov.io/gh/apache/hudi/pull/2447/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvTWV0YWRhdGFQYXJ0aXRpb25UeXBlLmphdmE=)
 | | | |
   | 
[...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/hudi/pull/2447/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==)
 | | | |
   | 
[...ommon/util/queue/BoundedInMemoryQueueConsumer.java](https://codecov.io/gh/apache/hudi/pull/2447/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvcXVldWUvQm91bmRlZEluTWVtb3J5UXVldWVDb25zdW1lci5qYXZh)
 | | | |
   | 
[...java/org/apache/hudi/common/util/NumericUtils.java](https://codecov.io/gh/apache/hudi/pull/2447/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvTnVtZXJpY1V0aWxzLmphdmE=)
 | | | |
   | 
[.../common/table/timeline/HoodieArchivedTimeline.java](https://codecov.io/gh/apache/hudi/pull/2447/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZUFyY2hpdmVkVGltZWxpbmUuamF2YQ==)
 | | | |
   | 
[.../apache/hudi/exception/TableNotFoundException.java](https://codecov.io/gh/apache/hudi/pull/2447/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL1RhYmxlTm90Rm91bmRFeGNlcHRpb24uamF2YQ==)
 | | | |
   | 
[...udi/common/util/queue/BoundedInMemoryExecutor.java](https://codecov.io/gh/apache/hudi/pull/2447/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvcXVldWUvQm91bmRlZEluTWVtb3J5RXhlY3V0b3IuamF2YQ==)
 | | | |
   | ... and [336 
more](https://codecov.io/gh/apache/hudi/pull/2447/diff?src=pr=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For 

[GitHub] [hudi] codecov-io commented on pull request #2471: [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFli…

2021-01-21 Thread GitBox


codecov-io commented on pull request #2471:
URL: https://github.com/apache/hudi/pull/2471#issuecomment-765116001


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=h1) Report
   > Merging 
[#2471](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=desc) (ef62c24) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/976420c49a3fe764f7cecd30b6cdb32861be5537?el=desc)
 (976420c) will **decrease** coverage by `40.58%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2471/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2471   +/-   ##
   
   - Coverage 50.27%   9.68%   -40.59% 
   + Complexity 3050  48 -3002 
   
 Files   419  53  -366 
 Lines 188971930-16967 
 Branches   1937 230 -1707 
   
   - Hits   9500 187 -9313 
   + Misses 86221730 -6892 
   + Partials775  13  -762 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <ø> (-59.75%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2471?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2471/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00% <0.00%> 

[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r562375198



##
File path: hudi-flink/src/test/resources/test_source.data
##
@@ -0,0 +1,8 @@
+{"uuid": "id1", "name": "Danny", "age": 23, "ts": "1970-01-01T00:00:01", 
"partition": "par1"}

Review comment:
   No, i'm old





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r562371379



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteFunction.java
##
@@ -0,0 +1,342 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.client.FlinkTaskContextSupplier;
+import org.apache.hudi.client.HoodieFlinkWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieFlinkEngineContext;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.util.ObjectSizeCalculator;
+import org.apache.hudi.keygen.KeyGenerator;
+import org.apache.hudi.operator.event.BatchWriteSuccessEvent;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.flink.annotation.VisibleForTesting;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.formats.avro.RowDataToAvroConverters;
+import org.apache.flink.runtime.operators.coordination.OperatorEventGateway;
+import org.apache.flink.runtime.state.FunctionInitializationContext;
+import org.apache.flink.runtime.state.FunctionSnapshotContext;
+import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;
+import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
+import org.apache.flink.table.types.logical.RowType;
+import org.apache.flink.util.Collector;
+import org.apache.flink.util.Preconditions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.Random;
+import java.util.concurrent.locks.Condition;
+import java.util.concurrent.locks.ReentrantLock;
+import java.util.function.BiFunction;
+
+/**
+ * Sink function to write the data to the underneath filesystem.
+ *
+ * Work Flow
+ *
+ * The function firstly buffers the data as a batch of {@link 
HoodieRecord}s,
+ * It flushes(write) the records batch when a Flink checkpoint starts. After a 
batch has been written successfully,
+ * the function notifies its operator coordinator {@link 
StreamWriteOperatorCoordinator} to mark a successful write.
+ *
+ * Exactly-once Semantics
+ *
+ * The task implements exactly-once semantics by buffering the data between 
checkpoints. The operator coordinator
+ * starts a new instant on the time line when a checkpoint triggers, the 
coordinator checkpoints always
+ * start before its operator, so when this function starts a checkpoint, a 
REQUESTED instant already exists.
+ * The function process thread then block data buffering and the checkpoint 
thread starts flushing the existing data buffer.
+ * When the existing data buffer write successfully, the process thread 
unblock and start buffering again for the next round checkpoint.
+ * Because any checkpoint failures would trigger the write rollback, it 
implements the exactly-once semantics.
+ *
+ * Fault Tolerance
+ *
+ * The operator coordinator checks the validity for the last instant when 
it starts a new one. The operator rolls back
+ * the written data and throws when any error occurs. This means any 
checkpoint or task failure would trigger a failover.
+ * The operator coordinator would try several times when committing the 
writestatus.
+ *
+ * Note: The function task requires the input stream be partitioned by the 
partition fields to avoid different write tasks
+ * write to the same file group that conflict. The general case for partition 
path is a datetime field,
+ * so the sink task is very possible to have IO bottleneck, the more flexible 
solution is to shuffle the
+ * data by the file group IDs.
+ *
+ * @param  Type of the input record
+ * @see StreamWriteOperatorCoordinator
+ */
+public class StreamWriteFunction
+extends KeyedProcessFunction
+implements CheckpointedFunction {
+
+  

[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r562371014



##
File path: hudi-flink/src/main/java/org/apache/hudi/operator/FlinkOptions.java
##
@@ -0,0 +1,249 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.streamer.FlinkStreamerConfig;
+import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.keygen.SimpleAvroKeyGenerator;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.flink.configuration.ConfigOption;
+import org.apache.flink.configuration.ConfigOptions;
+import org.apache.flink.configuration.Configuration;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Hoodie Flink config options.
+ *
+ * It has the options for Hoodie table read and write. It also defines some 
utilities.
+ */
+public class FlinkOptions {
+  private FlinkOptions() {
+  }
+
+  // 
+  //  Base Options
+  // 
+  public static final ConfigOption PATH = ConfigOptions
+  .key("path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Base path for the target hoodie table."
+  + "\nThe path would be created if it does not exist,\n"
+  + "otherwise a Hoodie table expects to be initialized successfully");
+
+  public static final ConfigOption PROPS_FILE_PATH = ConfigOptions
+  .key("properties-file.path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Path to properties file on local-fs or dfs, with 
configurations for \n"
+  + "hoodie client, schema provider, key generator and data source. 
For hoodie client props, sane defaults are\n"
+  + "used, but recommend use to provide basic things like metrics 
endpoints, hive configs etc. For sources, refer\n"
+  + "to individual classes, for supported properties");
+
+  // 
+  //  Read Options
+  // 
+  public static final ConfigOption READ_SCHEMA_FILE_PATH = 
ConfigOptions
+  .key("read.schema.file.path")
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Avro schema file path, the parsed schema is used for 
deserializing");
+
+  // 
+  //  Write Options
+  // 
+  public static final ConfigOption TABLE_NAME = ConfigOptions
+  .key(HoodieWriteConfig.TABLE_NAME)
+  .stringType()
+  .noDefaultValue()
+  .withDescription("Table name to register to Hive metastore");
+
+  public static final ConfigOption TABLE_TYPE = ConfigOptions
+  .key("write.table.type")
+  .stringType()
+  .defaultValue("COPY_ON_WRITE")
+  .withDescription("Type of table to write. COPY_ON_WRITE (or) 
MERGE_ON_READ");
+
+  public static final ConfigOption OPERATION = ConfigOptions
+  .key("write.operation")
+  .stringType()
+  .defaultValue("upsert")
+  .withDescription("The write operation, that this write should do");
+
+  public static final ConfigOption PRECOMBINE_FIELD = ConfigOptions
+  .key("write.precombine.field")

Review comment:
   We can, but i would suggest to do this is a separate issue, we need to 
define some rules for config option name.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r562367018



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamerV2.java
##
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.streamer;
+
+import org.apache.hudi.operator.FlinkOptions;
+import org.apache.hudi.operator.StreamWriteOperatorFactory;
+import org.apache.hudi.util.StreamerUtil;
+
+import com.beust.jcommander.JCommander;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.formats.avro.typeutils.AvroSchemaConverter;
+import org.apache.flink.formats.json.JsonRowDataDeserializationSchema;
+import org.apache.flink.formats.json.TimestampFormat;
+import org.apache.flink.runtime.state.filesystem.FsStateBackend;
+import org.apache.flink.streaming.api.datastream.DataStream;
+import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
+import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.runtime.typeutils.InternalTypeInfo;
+import org.apache.flink.table.types.logical.LogicalType;
+import org.apache.flink.table.types.logical.RowType;
+
+import java.util.Properties;
+
+/**
+ * An Utility which can incrementally consume data from Kafka and apply it to 
the target table.
+ * currently, it only support COW table and insert, upsert operation.
+ */
+public class HoodieFlinkStreamerV2 {
+  public static void main(String[] args) throws Exception {
+StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
+
+final FlinkStreamerConfig cfg = new FlinkStreamerConfig();
+JCommander cmd = new JCommander(cfg, null, args);
+if (cfg.help || args.length == 0) {
+  cmd.usage();
+  System.exit(1);
+}
+env.enableCheckpointing(cfg.checkpointInterval);
+env.getConfig().setGlobalJobParameters(cfg);
+// We use checkpoint to trigger write operation, including instant 
generating and committing,
+// There can only be one checkpoint at one time.
+env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
+env.disableOperatorChaining();

Review comment:
   Not necessary, removed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r562366909



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/streamer/FlinkStreamerConfig.java
##
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.streamer;
+
+import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.keygen.SimpleAvroKeyGenerator;
+
+import com.beust.jcommander.Parameter;
+import org.apache.flink.configuration.Configuration;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Configurations for Hoodie Flink streamer.
+ */
+public class FlinkStreamerConfig extends Configuration {
+  @Parameter(names = {"--kafka-topic"}, description = "Kafka topic name.", 
required = true)
+  public String kafkaTopic;
+
+  @Parameter(names = {"--kafka-group-id"}, description = "Kafka consumer group 
id.", required = true)
+  public String kafkaGroupId;
+
+  @Parameter(names = {"--kafka-bootstrap-servers"}, description = "Kafka 
bootstrap.servers.", required = true)
+  public String kafkaBootstrapServers;
+
+  @Parameter(names = {"--flink-checkpoint-path"}, description = "Flink 
checkpoint path.")
+  public String flinkCheckPointPath;
+
+  @Parameter(names = {"--flink-block-retry-times"}, description = "Times to 
retry when latest instant has not completed.")

Review comment:
   Copied from the old code, how about `instant-retry-times` ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r562366032



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/event/BatchWriteSuccessEvent.java
##
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator.event;
+
+import org.apache.flink.runtime.operators.coordination.OperatorEvent;
+
+import org.apache.hudi.client.WriteStatus;
+
+import java.util.List;
+
+/**
+ * An operator even to mark successful checkpoint batch write.
+ */
+public class BatchWriteSuccessEvent implements OperatorEvent {
+  private static final long serialVersionUID = 1L;
+
+  private final List writeStatuses;
+  private final int taskID;
+  private final String instantTime;
+
+  /**
+   * Creates an event.
+   *
+   * @param taskIDThe task ID
+   * @param instantTime   The instant time under which to write the data
+   * @param writeStatuses The write statues list
+   */
+  public BatchWriteSuccessEvent(

Review comment:
   Removed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r562365771



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteFunction.java
##
@@ -0,0 +1,344 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.client.FlinkTaskContextSupplier;
+import org.apache.hudi.client.HoodieFlinkWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieFlinkEngineContext;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.util.ObjectSizeCalculator;
+import org.apache.hudi.keygen.KeyGenerator;
+import org.apache.hudi.operator.event.BatchWriteSuccessEvent;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.flink.annotation.VisibleForTesting;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.formats.avro.RowDataToAvroConverters;
+import org.apache.flink.runtime.operators.coordination.OperatorEventGateway;
+import org.apache.flink.runtime.state.FunctionInitializationContext;
+import org.apache.flink.runtime.state.FunctionSnapshotContext;
+import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;
+import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
+import org.apache.flink.table.types.logical.RowType;
+import org.apache.flink.util.Collector;
+import org.apache.flink.util.Preconditions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.Random;
+import java.util.concurrent.locks.Condition;
+import java.util.concurrent.locks.ReentrantLock;
+import java.util.function.BiFunction;
+
+/**
+ * Sink function to write the data to the underneath filesystem.
+ *
+ * Work Flow
+ *
+ * The function firstly buffers the data as a batch of {@link 
HoodieRecord}s,
+ * It flushes(write) the records batch when a Flink checkpoint starts. After a 
batch has been written successfully,
+ * the function notifies its operator coordinator {@link 
StreamWriteOperatorCoordinator} to mark a successful write.
+ *
+ * Exactly-once Semantics
+ *
+ * The task implements exactly-once semantics by buffering the data between 
checkpoints. The operator coordinator
+ * starts a new instant on the time line when a checkpoint triggers, the 
coordinator checkpoints always
+ * start before its operator, so when this function starts a checkpoint, a 
REQUESTED instant already exists.
+ * The function process thread then block data buffering and the checkpoint 
thread starts flushing the existing data buffer.
+ * When the existing data buffer write successfully, the process thread 
unblock and start buffering again for the next round checkpoint.
+ * Because any checkpoint failures would trigger the write rollback, it 
implements the exactly-once semantics.
+ *
+ * Fault Tolerance
+ *
+ * The operator coordinator checks the validity for the last instant when 
it starts a new one. The operator rolls back
+ * the written data and throws when any error occurs. This means any 
checkpoint or task failure would trigger a failover.
+ * The operator coordinator would try several times when committing the 
writestatus.
+ *
+ * Note: The function task requires the input stream be partitioned by the 
partition fields to avoid different write tasks
+ * write to the same file group that conflict. The general case for partition 
path is a datetime field,
+ * so the sink task is very possible to have IO bottleneck, the more flexible 
solution is to shuffle the
+ * data by the file group IDs.
+ *
+ * @param  Type of the input record
+ * @see StreamWriteOperatorCoordinator
+ */
+public class StreamWriteFunction extends KeyedProcessFunction implements CheckpointedFunction {
+
+  private static 

[GitHub] [hudi] danny0405 commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


danny0405 commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r562365771



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/StreamWriteFunction.java
##
@@ -0,0 +1,344 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.operator;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.client.FlinkTaskContextSupplier;
+import org.apache.hudi.client.HoodieFlinkWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.common.HoodieFlinkEngineContext;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.util.ObjectSizeCalculator;
+import org.apache.hudi.keygen.KeyGenerator;
+import org.apache.hudi.operator.event.BatchWriteSuccessEvent;
+import org.apache.hudi.util.StreamerUtil;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.flink.annotation.VisibleForTesting;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.formats.avro.RowDataToAvroConverters;
+import org.apache.flink.runtime.operators.coordination.OperatorEventGateway;
+import org.apache.flink.runtime.state.FunctionInitializationContext;
+import org.apache.flink.runtime.state.FunctionSnapshotContext;
+import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;
+import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
+import org.apache.flink.table.types.logical.RowType;
+import org.apache.flink.util.Collector;
+import org.apache.flink.util.Preconditions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.Random;
+import java.util.concurrent.locks.Condition;
+import java.util.concurrent.locks.ReentrantLock;
+import java.util.function.BiFunction;
+
+/**
+ * Sink function to write the data to the underneath filesystem.
+ *
+ * Work Flow
+ *
+ * The function firstly buffers the data as a batch of {@link 
HoodieRecord}s,
+ * It flushes(write) the records batch when a Flink checkpoint starts. After a 
batch has been written successfully,
+ * the function notifies its operator coordinator {@link 
StreamWriteOperatorCoordinator} to mark a successful write.
+ *
+ * Exactly-once Semantics
+ *
+ * The task implements exactly-once semantics by buffering the data between 
checkpoints. The operator coordinator
+ * starts a new instant on the time line when a checkpoint triggers, the 
coordinator checkpoints always
+ * start before its operator, so when this function starts a checkpoint, a 
REQUESTED instant already exists.
+ * The function process thread then block data buffering and the checkpoint 
thread starts flushing the existing data buffer.
+ * When the existing data buffer write successfully, the process thread 
unblock and start buffering again for the next round checkpoint.
+ * Because any checkpoint failures would trigger the write rollback, it 
implements the exactly-once semantics.
+ *
+ * Fault Tolerance
+ *
+ * The operator coordinator checks the validity for the last instant when 
it starts a new one. The operator rolls back
+ * the written data and throws when any error occurs. This means any 
checkpoint or task failure would trigger a failover.
+ * The operator coordinator would try several times when committing the 
writestatus.
+ *
+ * Note: The function task requires the input stream be partitioned by the 
partition fields to avoid different write tasks
+ * write to the same file group that conflict. The general case for partition 
path is a datetime field,
+ * so the sink task is very possible to have IO bottleneck, the more flexible 
solution is to shuffle the
+ * data by the file group IDs.
+ *
+ * @param  Type of the input record
+ * @see StreamWriteOperatorCoordinator
+ */
+public class StreamWriteFunction extends KeyedProcessFunction implements CheckpointedFunction {
+
+  private static 

[GitHub] [hudi] codecov-io edited a comment on pull request #2426: [HUDI-304] Configure spotless and java style

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2426:
URL: https://github.com/apache/hudi/pull/2426#issuecomment-757443274


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2426?src=pr=h1) Report
   > Merging 
[#2426](https://codecov.io/gh/apache/hudi/pull/2426?src=pr=desc) (9cb1268) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/e926c1a45ca95fa1911f6f88a0577554f2797760?el=desc)
 (e926c1a) will **decrease** coverage by `0.20%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2426/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2426?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2426  +/-   ##
   
   - Coverage 50.73%   50.52%   -0.21% 
   + Complexity 3064 3032  -32 
   
 Files   419  417   -2 
 Lines 1879718725  -72 
 Branches   1922 1917   -5 
   
   - Hits   9536 9461  -75 
   - Misses 8485 8489   +4 
   + Partials776  775   -1 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.28% <ø> (+0.01%)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.61% <ø> (-0.41%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `10.20% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.00% <ø> (-0.06%)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `66.07% <ø> (+0.16%)` | `0.00 <ø> (ø)` | |
   | hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `66.84% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.41% <ø> (-0.02%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2426?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/common/engine/TaskContextSupplier.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9UYXNrQ29udGV4dFN1cHBsaWVyLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...e/hudi/metadata/FileSystemBackedTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvRmlsZVN5c3RlbUJhY2tlZFRhYmxlTWV0YWRhdGEuamF2YQ==)
 | `0.00% <0.00%> (-94.60%)` | `0.00% <0.00%> (-13.00%)` | |
   | 
[...apache/hudi/common/engine/HoodieEngineContext.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9Ib29kaWVFbmdpbmVDb250ZXh0LmphdmE=)
 | `0.00% <0.00%> (-66.67%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...g/apache/hudi/common/function/FunctionWrapper.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Z1bmN0aW9uL0Z1bmN0aW9uV3JhcHBlci5qYXZh)
 | `0.00% <0.00%> (-10.53%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==)
 | `79.31% <0.00%> (-10.35%)` | `15.00% <0.00%> (-1.00%)` | |
   | 
[...c/main/java/org/apache/hudi/common/fs/FSUtils.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL0ZTVXRpbHMuamF2YQ==)
 | `45.71% <0.00%> (-6.18%)` | `55.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/common/table/HoodieTableMetaClient.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlTWV0YUNsaWVudC5qYXZh)
 | `68.96% <0.00%> (-1.73%)` | `43.00% <0.00%> (-2.00%)` | |
   | 
[...i/common/table/timeline/TimelineMetadataUtils.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL1RpbWVsaW5lTWV0YWRhdGFVdGlscy5qYXZh)
 | `71.69% <0.00%> (-1.03%)` | `17.00% <0.00%> (ø%)` | |
   | 
[...rg/apache/hudi/hadoop/HoodieROTablePathFilter.java](https://codecov.io/gh/apache/hudi/pull/2426/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0hvb2RpZVJPVGFibGVQYXRoRmlsdGVyLmphdmE=)
 | `63.15% <0.00%> (-0.48%)` | `12.00% <0.00%> (ø%)` | |
   | 

[jira] [Closed] (HUDI-1332) Introduce FlinkHoodieBloomIndex to hudi-flink-client

2021-01-21 Thread Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Li closed HUDI-1332.
-

> Introduce FlinkHoodieBloomIndex to hudi-flink-client
> 
>
> Key: HUDI-1332
> URL: https://issues.apache.org/jira/browse/HUDI-1332
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: Xiang Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> a flink implementation for bloom index



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1332) Introduce FlinkHoodieBloomIndex to hudi-flink-client

2021-01-21 Thread Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Li resolved HUDI-1332.
---
Resolution: Fixed

> Introduce FlinkHoodieBloomIndex to hudi-flink-client
> 
>
> Key: HUDI-1332
> URL: https://issues.apache.org/jira/browse/HUDI-1332
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: Xiang Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> a flink implementation for bloom index



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch master updated (b64d22e -> 641abe8)

2021-01-21 Thread garyli
This is an automated email from the ASF dual-hosted git repository.

garyli pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from b64d22e  [HUDI-1511] InstantGenerateOperator support multiple 
parallelism (#2434)
 add 641abe8  [HUDI-1332] Introduce FlinkHoodieBloomIndex to 
hudi-flink-client (#2375)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/index/FlinkHoodieIndex.java|   3 +
 .../hudi/index/bloom/FlinkHoodieBloomIndex.java| 267 +
 .../bloom/HoodieFlinkBloomIndexCheckFunction.java} |  31 ++-
 .../src/main}/resources/log4j-surefire.properties  |   0
 .../index/bloom/TestFlinkHoodieBloomIndex.java}| 203 
 .../testutils/HoodieFlinkClientTestHarness.java|  94 
 .../testutils/HoodieFlinkWriteableTestTable.java   | 136 +++
 7 files changed, 626 insertions(+), 108 deletions(-)
 create mode 100644 
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/index/bloom/FlinkHoodieBloomIndex.java
 copy 
hudi-client/{hudi-spark-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexCheckFunction.java
 => 
hudi-flink-client/src/main/java/org/apache/hudi/index/bloom/HoodieFlinkBloomIndexCheckFunction.java}
 (77%)
 copy {hudi-spark-datasource/hudi-spark/src/test => 
hudi-client/hudi-flink-client/src/main}/resources/log4j-surefire.properties 
(100%)
 copy 
hudi-client/{hudi-spark-client/src/test/java/org/apache/hudi/index/bloom/TestHoodieBloomIndex.java
 => 
hudi-flink-client/src/test/java/org/apache/hudi/index/bloom/TestFlinkHoodieBloomIndex.java}
 (72%)
 create mode 100644 
hudi-client/hudi-flink-client/src/test/java/org/apache/hudi/testutils/HoodieFlinkWriteableTestTable.java



[GitHub] [hudi] garyli1019 merged pull request #2375: [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client

2021-01-21 Thread GitBox


garyli1019 merged pull request #2375:
URL: https://github.com/apache/hudi/pull/2375


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #2471: [MINOR]Remove InstantGeneratorOperator parallelism limit in HoodieFli…

2021-01-21 Thread GitBox


wangxianghu commented on pull request #2471:
URL: https://github.com/apache/hudi/pull/2471#issuecomment-765067817


   @yanghua @loukey-lj please take a look when free



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu opened a new pull request #2471: [MINOR]Remove InstantGeneratorOperator parallelism limit in HoodieFli…

2021-01-21 Thread GitBox


wangxianghu opened a new pull request #2471:
URL: https://github.com/apache/hudi/pull/2471


   …nkStreamer and update docs
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on a change in pull request #2375: [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client

2021-01-21 Thread GitBox


wangxianghu commented on a change in pull request #2375:
URL: https://github.com/apache/hudi/pull/2375#discussion_r562308054



##
File path: 
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/index/bloom/HoodieFlinkBloomIndexCheckFunction.java
##
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.client.utils.LazyIterableIterator;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIndexException;
+import org.apache.hudi.io.HoodieKeyLookupHandle;
+import org.apache.hudi.io.HoodieKeyLookupHandle.KeyLookupResult;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.function.Function;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+import scala.Tuple2;
+
+/**
+ * Function performing actual checking of List partition containing (fileId, 
hoodieKeys) against the actual files.
+ */
+//TODO we can move this class into the hudi-client-common and reuse it for 
spark client
+public class HoodieFlinkBloomIndexCheckFunction
+implements Function>, 
Iterator>> {
+
+  private final HoodieTable hoodieTable;
+
+  private final HoodieWriteConfig config;
+
+  public HoodieFlinkBloomIndexCheckFunction(HoodieTable hoodieTable, 
HoodieWriteConfig config) {
+this.hoodieTable = hoodieTable;
+this.config = config;
+  }
+
+  @Override
+  public Iterator> apply(Iterator> fileParitionRecordKeyTripletItr) {
+return new LazyKeyCheckIterator(fileParitionRecordKeyTripletItr);
+  }
+
+  @Override
+  public  Function>> compose(Function>> before) {
+return null;
+  }
+
+  @Override
+  public  Function>, V> 
andThen(Function>, ? extends V> after) {
+return null;
+  }
+
+  class LazyKeyCheckIterator extends LazyIterableIterator, List> {

Review comment:
   > maybe we can move HoodieFlinkBloomIndexCheckFunction into the 
hudi-client-common later then spark can reuse it.
   
   yes, could be annother pr





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-1511) InstantGenerateOperator support multiple parallelism

2021-01-21 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-1511.
--
Resolution: Implemented

Fixed via master branch: b64d22e0478b967c83be22509beaaf8c5f114e19

> InstantGenerateOperator support multiple parallelism
> 
>
> Key: HUDI-1511
> URL: https://issues.apache.org/jira/browse/HUDI-1511
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: loukey_j
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch master updated: [HUDI-1511] InstantGenerateOperator support multiple parallelism (#2434)

2021-01-21 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new b64d22e  [HUDI-1511] InstantGenerateOperator support multiple 
parallelism (#2434)
b64d22e is described below

commit b64d22e0478b967c83be22509beaaf8c5f114e19
Author: luokey <854194...@qq.com>
AuthorDate: Fri Jan 22 09:17:50 2021 +0800

[HUDI-1511] InstantGenerateOperator support multiple parallelism (#2434)
---
 .../hudi/operator/InstantGenerateOperator.java | 170 +++--
 1 file changed, 123 insertions(+), 47 deletions(-)

diff --git 
a/hudi-flink/src/main/java/org/apache/hudi/operator/InstantGenerateOperator.java
 
b/hudi-flink/src/main/java/org/apache/hudi/operator/InstantGenerateOperator.java
index 4e32ec7..7879243 100644
--- 
a/hudi-flink/src/main/java/org/apache/hudi/operator/InstantGenerateOperator.java
+++ 
b/hudi-flink/src/main/java/org/apache/hudi/operator/InstantGenerateOperator.java
@@ -38,10 +38,13 @@ import 
org.apache.flink.runtime.state.StateInitializationContext;
 import org.apache.flink.runtime.state.StateSnapshotContext;
 import org.apache.flink.streaming.api.operators.AbstractStreamOperator;
 import org.apache.flink.streaming.api.operators.OneInputStreamOperator;
+import org.apache.flink.streaming.api.operators.StreamingRuntimeContext;
 import org.apache.flink.streaming.runtime.streamrecord.StreamRecord;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.PathFilter;
 
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@@ -49,9 +52,9 @@ import org.slf4j.LoggerFactory;
 import java.io.IOException;
 import java.util.ArrayList;
 import java.util.Iterator;
-import java.util.LinkedList;
 import java.util.List;
 import java.util.concurrent.TimeUnit;
+import java.util.concurrent.atomic.AtomicLong;
 
 /**
  * Operator helps to generate globally unique instant, it must be executed in 
one parallelism. Before generate a new
@@ -71,16 +74,20 @@ public class InstantGenerateOperator extends 
AbstractStreamOperator latestInstantList = new ArrayList<>(1);
   private transient ListState latestInstantState;
-  private List bufferedRecords = new LinkedList();
-  private transient ListState recordsState;
   private Integer retryTimes;
   private Integer retryInterval;
+  private static final String DELIMITER = "_";
+  private static final String INSTANT_MARKER_FOLDER_NAME = ".instant_marker";
+  private transient boolean isMain = false;
+  private transient AtomicLong recordCounter = new AtomicLong(0);
+  private StreamingRuntimeContext runtimeContext;
+  private int indexOfThisSubtask;
 
   @Override
   public void processElement(StreamRecord streamRecord) throws 
Exception {
 if (streamRecord.getValue() != null) {
-  bufferedRecords.add(streamRecord);
   output.collect(streamRecord);
+  recordCounter.incrementAndGet();
 }
   }
 
@@ -88,7 +95,7 @@ public class InstantGenerateOperator extends 
AbstractStreamOperator latestInstantStateDescriptor = new 
ListStateDescriptor("latestInstant", String.class);
-latestInstantState = 
context.getOperatorStateStore().getListState(latestInstantStateDescriptor);
-
-// recordState
-ListStateDescriptor recordsStateDescriptor = new 
ListStateDescriptor("recordsState", StreamRecord.class);
-recordsState = 
context.getOperatorStateStore().getListState(recordsStateDescriptor);
-
-if (context.isRestored()) {
-  Iterator latestInstantIterator = 
latestInstantState.get().iterator();
-  latestInstantIterator.forEachRemaining(x -> latestInstant = x);
-  LOG.info("InstantGenerateOperator initializeState get latestInstant 
[{}]", latestInstant);
-
-  Iterator recordIterator = recordsState.get().iterator();
-  bufferedRecords.clear();
-  recordIterator.forEachRemaining(x -> bufferedRecords.add(x));
+runtimeContext = getRuntimeContext();
+indexOfThisSubtask = runtimeContext.getIndexOfThisSubtask();
+isMain = indexOfThisSubtask == 0;
+
+if (isMain) {
+  // instantState
+  ListStateDescriptor latestInstantStateDescriptor = new 
ListStateDescriptor<>("latestInstant", String.class);
+  latestInstantState = 
context.getOperatorStateStore().getListState(latestInstantStateDescriptor);
+
+  if (context.isRestored()) {
+Iterator latestInstantIterator = 
latestInstantState.get().iterator();
+latestInstantIterator.forEachRemaining(x -> latestInstant = x);
+LOG.info("Restoring the latest instant [{}] from the state", 
latestInstant);
+  }
 }
   }
 
   @Override
   public void snapshotState(StateSnapshotContext functionSnapshotContext) 
throws Exception {
-if (latestInstantList.isEmpty()) {
-  latestInstantList.add(latestInstant);
+long 

[GitHub] [hudi] yanghua merged pull request #2434: [HUDI-1511] InstantGenerateOperator support multiple parallelism

2021-01-21 Thread GitBox


yanghua merged pull request #2434:
URL: https://github.com/apache/hudi/pull/2434


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on pull request #2434: [HUDI-1511] InstantGenerateOperator support multiple parallelism

2021-01-21 Thread GitBox


yanghua commented on pull request #2434:
URL: https://github.com/apache/hudi/pull/2434#issuecomment-765047901


   > @yanghua should be good to land now if you are happy with it
   
   ack, thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #2447: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

2021-01-21 Thread GitBox


leesf commented on pull request #2447:
URL: https://github.com/apache/hudi/pull/2447#issuecomment-765010589


   > @leesf Hello, I have a doubt now. I did not modify the code of 
`hudi-ineg-test`, but every time the check fails because of it, do you know why?
   
   would you please rebase to master since there is a flaky test fix against 
the master.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on issue #2470: [SUPPORT] Heavy skew in ListingBasedRollbackHelper

2021-01-21 Thread GitBox


vinothchandar commented on issue #2470:
URL: https://github.com/apache/hudi/issues/2470#issuecomment-764870104


   @jtmzheng can you just give the marker based rollbacks a shot? We intend to 
make it the default in the next release. If the issue is from listing, then it 
would help out a lot. 
   `hoodie.rollback.using.markers=true` 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on issue #2470: [SUPPORT] Heavy skew in ListingBasedRollbackHelper

2021-01-21 Thread GitBox


vinothchandar commented on issue #2470:
URL: https://github.com/apache/hudi/issues/2470#issuecomment-764870271


   we parallelize by partitions, so it must be fast. not sure where the skew 
comes from.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2434: [HUDI-1511] InstantGenerateOperator support multiple parallelism

2021-01-21 Thread GitBox


vinothchandar commented on pull request #2434:
URL: https://github.com/apache/hudi/pull/2434#issuecomment-764820382


   @yanghua should be good to land now if you are happy with it



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] jtmzheng opened a new issue #2470: [SUPPORT] Heavy skew in ListingBasedRollbackHelper

2021-01-21 Thread GitBox


jtmzheng opened a new issue #2470:
URL: https://github.com/apache/hudi/issues/2470


   **Describe the problem you faced**
   We're seeing heavy skew in our Spark Streaming job when processing rollbacks 
(`mapToPair at ListingBasedRollbackHelper.java:100`). The example below took 
2.2h to complete with a p75 of 15 minutes and a max of 2.2h (100 tasks, long 
tail of 8 tasks that took > 1 hour)
   
   https://user-images.githubusercontent.com/3466206/105383583-4184f800-5bdf-11eb-8368-c345d452a6eb.png;>
   
   What can cause this skew and what can we do to alleviate/investigate this?
   
   We have Hudi configured with:
   
   ```
   hudi_options = {
   "hoodie.table.name": "transactions",
   "hoodie.datasource.write.recordkey.field": "id.value",
   "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
   "hoodie.datasource.write.partitionpath.field": "year,month,day",
   "hoodie.datasource.write.table.name": "transactions",
   "hoodie.datasource.write.table.type": "MERGE_ON_READ",
   "hoodie.datasource.write.operation": "upsert",
   "hoodie.consistency.check.enabled": "true",
   "hoodie.datasource.write.precombine.field": "publishedAtUnixNano",
   "hoodie.compact.inline": True,
   "hoodie.compact.inline.max.delta.commits": 10,
   "hoodie.cleaner.commits.retained": 1,
   }
   ```
   
   **Environment Description**
   
   * Hudi version : 0.6.0
   
   * Spark version : 2.4.6 (EMR 5.31)
   
   * Hive version : 2.3.7
   
   * Hadoop version : Amazon 2.10.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


yanghua commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r561953785



##
File path: hudi-flink/src/test/resources/test_read_schema.avsc
##
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+{
+  "type" : "record",
+  "name" : "record",
+  "fields" : [ {

Review comment:
   Can we follow the same style about the test schema, align with the 
existed file, e.g. `HoodieCleanerPlan.avsc`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (#2412)

2021-01-21 Thread garyli
This is an automated email from the ASF dual-hosted git repository.

garyli pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 976420c  [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 
(#2412)
976420c is described below

commit 976420c49a3fe764f7cecd30b6cdb32861be5537
Author: wenningd 
AuthorDate: Thu Jan 21 07:04:28 2021 -0800

[HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (#2412)

* [HUDI-1512] Fix spark 2 unit tests failure with Spark 3

* resolve comments

Co-authored-by: Wenning Ding 
---
 hudi-spark-datasource/hudi-spark2/pom.xml | 9 -
 pom.xml   | 9 ++---
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/hudi-spark-datasource/hudi-spark2/pom.xml 
b/hudi-spark-datasource/hudi-spark2/pom.xml
index 8668984..1f6f792 100644
--- a/hudi-spark-datasource/hudi-spark2/pom.xml
+++ b/hudi-spark-datasource/hudi-spark2/pom.xml
@@ -125,6 +125,13 @@
 
   
   
+org.apache.maven.plugins
+maven-surefire-plugin
+
+  ${skip.hudi-spark2.unit.tests}
+
+  
+  
 org.apache.rat
 apache-rat-plugin
   
@@ -144,7 +151,7 @@
 
   org.scala-lang
   scala-library
-  ${scala.version}
+  ${scala11.version}
 
 
 
diff --git a/pom.xml b/pom.xml
index 6a39cef..4f8b153 100644
--- a/pom.xml
+++ b/pom.xml
@@ -105,14 +105,15 @@
 4.1.1
 0.8.0
 4.4.1
-2.4.4
+${spark2.version}
 1.12.0
 2.4.4
 3.0.0
 1.8.2
-2.11.12
-2.11
+2.11.12
 2.12.10
+${scala11.version}
+2.11
 0.12
 3.3.1
 3.0.1
@@ -140,6 +141,7 @@
 compile
 
org.apache.hudi.
 true
+false
   
 
   
@@ -1361,6 +1363,7 @@
 
${fasterxml.spark3.version}
 
${fasterxml.spark3.version}
 
${fasterxml.spark3.version}
+true
   
   
 



[GitHub] [hudi] garyli1019 merged pull request #2412: [HUDI-1512] Fix spark 2 unit tests failure with Spark 3

2021-01-21 Thread GitBox


garyli1019 merged pull request #2412:
URL: https://github.com/apache/hudi/pull/2412


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #2412: [HUDI-1512] Fix spark 2 unit tests failure with Spark 3

2021-01-21 Thread GitBox


garyli1019 commented on pull request #2412:
URL: https://github.com/apache/hudi/pull/2412#issuecomment-764705432


   should be ok to merge since we cut the release



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


yanghua commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r561943377



##
File path: hudi-flink/src/test/resources/test_source.data
##
@@ -0,0 +1,8 @@
+{"uuid": "id1", "name": "Danny", "age": 23, "ts": "1970-01-01T00:00:01", 
"partition": "par1"}

Review comment:
   Are you really so young?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2459: [MINOR] Improve code readability,remove the continue keyword

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2459:
URL: https://github.com/apache/hudi/pull/2459#issuecomment-763007666


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2459?src=pr=h1) Report
   > Merging 
[#2459](https://codecov.io/gh/apache/hudi/pull/2459?src=pr=desc) (4429a75) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/81ccb0c71ad2c17c5613698d4fb50f3b49b21fb4?el=desc)
 (81ccb0c) will **increase** coverage by `0.01%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2459/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2459?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2459  +/-   ##
   
   + Coverage 50.27%   50.28%   +0.01% 
   - Complexity 3050 3051   +1 
   
 Files   419  419  
 Lines 1889718897  
 Branches   1937 1937  
   
   + Hits   9500 9503   +3 
   + Misses 8622 8619   -3 
 Partials775  775  
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.21% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.52% <ø> (+0.03%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `0.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `65.85% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.43% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2459?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/hudi/pull/2459/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==)
 | `89.65% <0.00%> (+10.34%)` | `16.00% <0.00%> (+1.00%)` | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2459: [MINOR] Improve code readability,remove the continue keyword

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2459:
URL: https://github.com/apache/hudi/pull/2459#issuecomment-763007666







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto commented on issue #2463: [SUPPORT] Tuning Hudi Upsert Job

2021-01-21 Thread GitBox


rubenssoto commented on issue #2463:
URL: https://github.com/apache/hudi/issues/2463#issuecomment-764683609


   Hi @bvaradar ,
   
   Could you helpme with this ? :)
   
   thank you



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto closed issue #2464: [SUPPORT] ExecutorLostFailure - Try Processing 1TB Of Data

2021-01-21 Thread GitBox


rubenssoto closed issue #2464:
URL: https://github.com/apache/hudi/issues/2464


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto commented on issue #2464: [SUPPORT] ExecutorLostFailure - Try Processing 1TB Of Data

2021-01-21 Thread GitBox


rubenssoto commented on issue #2464:
URL: https://github.com/apache/hudi/issues/2464#issuecomment-764682521


   Thank you so much for your help @bvaradar , it worked.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2388: [HUDI-1353] add incremental timeline support for pending clustering ops

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2388:
URL: https://github.com/apache/hudi/pull/2388#issuecomment-751907604


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2388?src=pr=h1) Report
   > Merging 
[#2388](https://codecov.io/gh/apache/hudi/pull/2388?src=pr=desc) (4423c0c) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/81ccb0c71ad2c17c5613698d4fb50f3b49b21fb4?el=desc)
 (81ccb0c) will **increase** coverage by `19.15%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2388/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2388?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2388   +/-   ##
   =
   + Coverage 50.27%   69.43%   +19.15% 
   + Complexity 3050  357 -2693 
   =
 Files   419   53  -366 
 Lines 18897 1930-16967 
 Branches   1937  230 -1707 
   =
   - Hits   9500 1340 -8160 
   + Misses 8622  456 -8166 
   + Partials775  134  -641 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.43% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2388?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../java/org/apache/hudi/common/util/CommitUtils.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQ29tbWl0VXRpbHMuamF2YQ==)
 | | | |
   | 
[...rg/apache/hudi/common/bloom/SimpleBloomFilter.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Jsb29tL1NpbXBsZUJsb29tRmlsdGVyLmphdmE=)
 | | | |
   | 
[...apache/hudi/common/model/HoodieCommitMetadata.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUNvbW1pdE1ldGFkYXRhLmphdmE=)
 | | | |
   | 
[...pache/hudi/io/storage/HoodieFileReaderFactory.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vc3RvcmFnZS9Ib29kaWVGaWxlUmVhZGVyRmFjdG9yeS5qYXZh)
 | | | |
   | 
[...i/common/table/log/block/HoodieHFileDataBlock.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVIRmlsZURhdGFCbG9jay5qYXZh)
 | | | |
   | 
[...hudi/common/config/DFSPropertiesConfiguration.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9ERlNQcm9wZXJ0aWVzQ29uZmlndXJhdGlvbi5qYXZh)
 | | | |
   | 
[.../org/apache/hudi/hadoop/utils/HoodieHiveUtils.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZUhpdmVVdGlscy5qYXZh)
 | | | |
   | 
[...e/hudi/common/model/HoodieRollingStatMetadata.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVJvbGxpbmdTdGF0TWV0YWRhdGEuamF2YQ==)
 | | | |
   | 
[...pache/hudi/common/model/HoodieArchivedLogFile.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUFyY2hpdmVkTG9nRmlsZS5qYXZh)
 | | | |
   | 
[.../hive/SlashEncodedHourPartitionValueExtractor.java](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvU2xhc2hFbmNvZGVkSG91clBhcnRpdGlvblZhbHVlRXh0cmFjdG9yLmphdmE=)
 | | | |
   | ... and [346 
more](https://codecov.io/gh/apache/hudi/pull/2388/diff?src=pr=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #2430: [HUDI-1522] Add a new pipeline for Flink writer

2021-01-21 Thread GitBox


yanghua commented on a change in pull request #2430:
URL: https://github.com/apache/hudi/pull/2430#discussion_r561848906



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/streamer/FlinkStreamerConfig.java
##
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.streamer;
+
+import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.keygen.SimpleAvroKeyGenerator;
+
+import com.beust.jcommander.Parameter;
+import org.apache.flink.configuration.Configuration;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Configurations for Hoodie Flink streamer.
+ */
+public class FlinkStreamerConfig extends Configuration {
+  @Parameter(names = {"--kafka-topic"}, description = "Kafka topic name.", 
required = true)
+  public String kafkaTopic;
+
+  @Parameter(names = {"--kafka-group-id"}, description = "Kafka consumer group 
id.", required = true)
+  public String kafkaGroupId;
+
+  @Parameter(names = {"--kafka-bootstrap-servers"}, description = "Kafka 
bootstrap.servers.", required = true)
+  public String kafkaBootstrapServers;
+
+  @Parameter(names = {"--flink-checkpoint-path"}, description = "Flink 
checkpoint path.")
+  public String flinkCheckPointPath;
+
+  @Parameter(names = {"--flink-block-retry-times"}, description = "Times to 
retry when latest instant has not completed.")

Review comment:
   `block` seems hard to understand. Any better word?

##
File path: 
hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamerV2.java
##
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.streamer;
+
+import org.apache.hudi.operator.FlinkOptions;
+import org.apache.hudi.operator.StreamWriteOperatorFactory;
+import org.apache.hudi.util.StreamerUtil;
+
+import com.beust.jcommander.JCommander;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.formats.avro.typeutils.AvroSchemaConverter;
+import org.apache.flink.formats.json.JsonRowDataDeserializationSchema;
+import org.apache.flink.formats.json.TimestampFormat;
+import org.apache.flink.runtime.state.filesystem.FsStateBackend;
+import org.apache.flink.streaming.api.datastream.DataStream;
+import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
+import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.runtime.typeutils.InternalTypeInfo;
+import org.apache.flink.table.types.logical.LogicalType;
+import org.apache.flink.table.types.logical.RowType;
+
+import java.util.Properties;
+
+/**
+ * An Utility which can incrementally consume data from Kafka and apply it to 
the target table.
+ * currently, it only support COW table and insert, upsert operation.
+ */
+public class HoodieFlinkStreamerV2 {
+  public static void main(String[] args) throws Exception {
+StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
+
+final FlinkStreamerConfig cfg = new FlinkStreamerConfig();
+JCommander cmd = new JCommander(cfg, null, args);
+if (cfg.help || args.length == 0) {
+  cmd.usage();
+  System.exit(1);
+}
+env.enableCheckpointing(cfg.checkpointInterval);
+env.getConfig().setGlobalJobParameters(cfg);
+// We use checkpoint to 

[GitHub] [hudi] codecov-io edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers

2021-01-21 Thread GitBox


codecov-io edited a comment on pull request #2374:
URL: https://github.com/apache/hudi/pull/2374#issuecomment-750782300


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=h1) Report
   > Merging 
[#2374](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=desc) (7ababea) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/a38612b10f6ae04644519270f9b5eb631a77c148?el=desc)
 (a38612b) will **decrease** coverage by `41.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2374/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2374   +/-   ##
   
   - Coverage 50.69%   9.68%   -41.01% 
   + Complexity 3059  48 -3011 
   
 Files   419  53  -366 
 Lines 188101930-16880 
 Branches   1924 230 -1694 
   
   - Hits   9535 187 -9348 
   + Misses 84981730 -6768 
   + Partials777  13  -764 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.68% <100.00%> (-59.80%)` | `0.00 <1.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==)
 | `83.62% <100.00%> (-5.18%)` | `28.00 <1.00> (ø)` | |
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` 

  1   2   >