date:20210221

[GitHub] [hudi] lamber-ken commented on a change in pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



lamber-ken commented on a change in pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#discussion_r580040125



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java
##
@@ -37,6 +38,29 @@
  */
 public class HoodieIndexUtils {
 
+  /**
+   * Fetches Pair of partition path and {@link HoodieBaseFile}s for interested 
partitions.
+   *
+   * @param partition   Partition of interest
+   * @param context Instance of {@link HoodieEngineContext} to use
+   * @param hoodieTable Instance of {@link HoodieTable} of interest
+   * @return the list of {@link HoodieBaseFile}
+   */
+  public static List getLatestBaseFilesForPartition(

Review comment:
   refer to `SparkHoodieBackedTableMetadataWriter#prepRecords`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lamber-ken commented on a change in pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



lamber-ken commented on a change in pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#discussion_r580038941



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java
##
@@ -37,6 +38,29 @@
  */
 public class HoodieIndexUtils {
 
+  /**
+   * Fetches Pair of partition path and {@link HoodieBaseFile}s for interested 
partitions.
+   *
+   * @param partition   Partition of interest
+   * @param context Instance of {@link HoodieEngineContext} to use
+   * @param hoodieTable Instance of {@link HoodieTable} of interest
+   * @return the list of {@link HoodieBaseFile}
+   */
+  public static List getLatestBaseFilesForPartition(

Review comment:
   It's good that `getLatestBaseFilesForPartition` was extracted from 
`getLatestBaseFilesForAllPartitions`.  
   
   Current codebase: 
   ```
 public static List getLatestBaseFilesForPartition(
 final String partition,
 final HoodieTable hoodieTable) {
   Option latestCommitTime = 
hoodieTable.getMetaClient().getCommitsTimeline()
   .filterCompletedInstants().lastInstant();
   if (latestCommitTime.isPresent()) {
 return hoodieTable.getBaseFileOnlyView()
 .getLatestBaseFilesBeforeOrOn(partition, 
latestCommitTime.get().getTimestamp())
 .collect(toList());
   }
   return Collections.emptyList();
 }
   ```
   
   Maybe the following implementation is more efficient
   ```
 public static List getLatestBaseFilesForPartition(
 final String partition,
 final HoodieTable hoodieTable) {
   return hoodieTable.getFileSystemView()
   .getAllFileGroups(partition)
   .map(HoodieFileGroup::getLatestDataFile)
   .filter(Option::isPresent)
   .map(Option::get)
   .collect(toList());
 }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



codecov-io edited a comment on pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#issuecomment-781103554







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a change in pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



danny0405 commented on a change in pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#discussion_r580010866



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/partitioner/BucketAssignFunction.java
##
@@ -112,6 +171,10 @@ public void processElement(I value, Context ctx, 
Collector out) throws Except
 final HoodieKey hoodieKey = record.getKey();
 final BucketInfo bucketInfo;
 final HoodieRecordLocation location;
+if (!allPartitionsLoaded && 
!partitionLoadState.contains(hoodieKey.getPartitionPath())) {

Review comment:
   We only need to ensure the initial partitions are loaded successfully, 
the new input data would trigger index update if there are new data partitions.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #2473: [HOTFIX] Revert upgrade flink verison to 1.12.0

2021-02-21 Thread GitBox



danny0405 commented on pull request #2473:
URL: https://github.com/apache/hudi/pull/2473#issuecomment-783124508


   > I have a question about this pr, is there any reason we don't want to use 
Flink 1.12.0? Looks like `FlinkKafkaConsumerBase` is also avaiable in 
`flink-connector-kafka_2.12`.
   
   There are some other conflicts for the Flink release 1.12.0, that means, the 
Flink 1.12.0 code can not run successfully on Flink 1.11.0 base jars.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #2515: [HUDI-1615] [SUPPORT] ERROR HoodieTimelineArchiveLog: Failed to archive commits

2021-02-21 Thread GitBox



vinothchandar commented on issue #2515:
URL: https://github.com/apache/hudi/issues/2515#issuecomment-783085992


   Folks, got side tracked by other work last week. Back on hudi, this week. We 
will get moving on this. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a change in pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



danny0405 commented on a change in pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#discussion_r579977745



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/partitioner/BucketAssignFunction.java
##
@@ -78,13 +130,14 @@ public BucketAssignFunction(Configuration conf) {
   public void open(Configuration parameters) throws Exception {
 super.open(parameters);
 HoodieWriteConfig writeConfig = 
StreamerUtil.getHoodieClientConfig(this.conf);
-HoodieFlinkEngineContext context =
-new HoodieFlinkEngineContext(
-new SerializableConfiguration(StreamerUtil.getHadoopConf()),
-new FlinkTaskContextSupplier(getRuntimeContext()));
-this.bucketAssigner = new BucketAssigner(
-context,
-writeConfig);
+this.hadoopConf = StreamerUtil.getHadoopConf();
+this.context = new HoodieFlinkEngineContext(
+new SerializableConfiguration(this.hadoopConf),
+new FlinkTaskContextSupplier(getRuntimeContext()));
+this.bucketAssigner = new BucketAssigner(context, writeConfig);
+final FileSystem fs = 
FSUtils.getFs(this.conf.getString(FlinkOptions.PATH), this.hadoopConf);

Review comment:
   Already removed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



danny0405 commented on pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#issuecomment-783069192


   > hi @danny0405 is it necessary to revert `this.bucketAssigner.reset();` to 
`BucketAssiginFunction#snapshotState` in this patch? : )
   
   I will do it in another patch, thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



codecov-io edited a comment on pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#issuecomment-781103554


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2581?src=pr=h1) Report
   > Merging 
[#2581](https://codecov.io/gh/apache/hudi/pull/2581?src=pr=desc) (813fa19) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/43a0776c7c88a5f7beac6c8853db7e341810635a?el=desc)
 (43a0776) will **increase** coverage by `0.03%`.
   > The diff coverage is `72.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2581/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2581?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2581  +/-   ##
   
   + Coverage 51.14%   51.18%   +0.03% 
   - Complexity 3215 3226  +11 
   
 Files   438  438  
 Lines 2004120084  +43 
 Branches   2064 2068   +4 
   
   + Hits  1025010279  +29 
   - Misses 8946 8958  +12 
   - Partials845  847   +2 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `36.87% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.35% <ø> (-0.01%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `46.25% <72.00%> (+0.81%)` | `0.00 <10.00> (ø)` | |
   | hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `69.75% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.46% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2581?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/hudi/pull/2581/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==)
 | `55.66% <ø> (-0.44%)` | `38.00 <0.00> (ø)` | |
   | 
[...udi/operator/partitioner/BucketAssignFunction.java](https://codecov.io/gh/apache/hudi/pull/2581/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9wYXJ0aXRpb25lci9CdWNrZXRBc3NpZ25GdW5jdGlvbi5qYXZh)
 | `79.51% <71.42%> (-12.79%)` | `18.00 <9.00> (+10.00)` | :arrow_down: |
   | 
[...ache/hudi/operator/partitioner/BucketAssigner.java](https://codecov.io/gh/apache/hudi/pull/2581/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9vcGVyYXRvci9wYXJ0aXRpb25lci9CdWNrZXRBc3NpZ25lci5qYXZh)
 | `80.17% <100.00%> (+0.17%)` | `19.00 <1.00> (+1.00)` | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



codecov-io edited a comment on pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#issuecomment-781103554







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lamber-ken commented on pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



lamber-ken commented on pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#issuecomment-783053827


   hi @danny0405 is it necessary to revert `this.bucketAssigner.reset();` to 
`BucketAssiginFunction#snapshotState` in this patch? : )



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lamber-ken commented on a change in pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



lamber-ken commented on a change in pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#discussion_r579964461



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/partitioner/BucketAssignFunction.java
##
@@ -78,13 +130,14 @@ public BucketAssignFunction(Configuration conf) {
   public void open(Configuration parameters) throws Exception {
 super.open(parameters);
 HoodieWriteConfig writeConfig = 
StreamerUtil.getHoodieClientConfig(this.conf);
-HoodieFlinkEngineContext context =
-new HoodieFlinkEngineContext(
-new SerializableConfiguration(StreamerUtil.getHadoopConf()),
-new FlinkTaskContextSupplier(getRuntimeContext()));
-this.bucketAssigner = new BucketAssigner(
-context,
-writeConfig);
+this.hadoopConf = StreamerUtil.getHadoopConf();
+this.context = new HoodieFlinkEngineContext(
+new SerializableConfiguration(this.hadoopConf),
+new FlinkTaskContextSupplier(getRuntimeContext()));
+this.bucketAssigner = new BucketAssigner(context, writeConfig);
+final FileSystem fs = 
FSUtils.getFs(this.conf.getString(FlinkOptions.PATH), this.hadoopConf);

Review comment:
   `fs` never used, we can remove it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] caidezhi commented on a change in pull request #2560: [HUDI-1606]align BaseJavaCommitActionExecuto#execute method with BaseSparkCommitActionExecutor

2021-02-21 Thread GitBox



caidezhi commented on a change in pull request #2560:
URL: https://github.com/apache/hudi/pull/2560#discussion_r579963408



##
File path: 
hudi-client/hudi-java-client/src/main/java/org/apache/hudi/table/action/commit/BaseJavaCommitActionExecutor.java
##
@@ -132,6 +132,11 @@ protected void updateIndex(List 
writeStatuses, HoodieWriteMetadata<
 result.setWriteStatuses(statuses);
   }
 
+  protected void updateIndexAndCommitIfNeeded(List writeStatuses, 
HoodieWriteMetadata> result) {

Review comment:
   @leesf if it is already covered by other PR, please feel free to close 
this one. Thanks for your review.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #1946: [HUDI-1176]Support log4j2 config

2021-02-21 Thread GitBox



codecov-io edited a comment on pull request #1946:
URL: https://github.com/apache/hudi/pull/1946#issuecomment-774846457


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1946?src=pr=h1) Report
   > Merging 
[#1946](https://codecov.io/gh/apache/hudi/pull/1946?src=pr=desc) (2fb88d1) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/43a0776c7c88a5f7beac6c8853db7e341810635a?el=desc)
 (43a0776) will **decrease** coverage by `0.05%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/1946/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1946?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1946  +/-   ##
   
   - Coverage 51.14%   51.09%   -0.06% 
   + Complexity 3215 3213   -2 
   
 Files   438  438  
 Lines 2004120041  
 Branches   2064 2064  
   
   - Hits  1025010239  -11 
   - Misses 8946 8959  +13 
   + Partials845  843   -2 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `36.87% <100.00%> (ø)` | `0.00 <1.00> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.37% <100.00%> (+0.01%)` | `0.00 <1.00> (ø)` | |
   | hudiflink | `45.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `68.98% <ø> (-0.78%)` | `0.00 <ø> (ø)` | |
   | hudisync | `48.61% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.46% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/1946?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...rg/apache/hudi/cli/commands/CompactionCommand.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL0NvbXBhY3Rpb25Db21tYW5kLmphdmE=)
 | `0.80% <ø> (ø)` | `2.00 <0.00> (ø)` | |
   | 
[...va/org/apache/hudi/cli/commands/ExportCommand.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL0V4cG9ydENvbW1hbmQuamF2YQ==)
 | `1.09% <ø> (ø)` | `1.00 <0.00> (ø)` | |
   | 
[...g/apache/hudi/cli/utils/SparkTempViewProvider.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL3V0aWxzL1NwYXJrVGVtcFZpZXdQcm92aWRlci5qYXZh)
 | `59.67% <ø> (ø)` | `12.00 <0.00> (ø)` | |
   | 
[...di/common/bootstrap/index/HFileBootstrapIndex.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Jvb3RzdHJhcC9pbmRleC9IRmlsZUJvb3RzdHJhcEluZGV4LmphdmE=)
 | `81.48% <ø> (ø)` | `18.00 <0.00> (ø)` | |
   | 
[...hudi/common/config/DFSPropertiesConfiguration.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9ERlNQcm9wZXJ0aWVzQ29uZmlndXJhdGlvbi5qYXZh)
 | `78.04% <ø> (ø)` | `14.00 <0.00> (ø)` | |
   | 
[...c/main/java/org/apache/hudi/common/fs/FSUtils.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL0ZTVXRpbHMuamF2YQ==)
 | `49.31% <ø> (ø)` | `61.00 <0.00> (ø)` | |
   | 
[...pache/hudi/common/fs/FailSafeConsistencyGuard.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL0ZhaWxTYWZlQ29uc2lzdGVuY3lHdWFyZC5qYXZh)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...che/hudi/common/fs/OptimisticConsistencyGuard.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL09wdGltaXN0aWNDb25zaXN0ZW5jeUd1YXJkLmphdmE=)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...apache/hudi/common/model/HoodieCommitMetadata.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUNvbW1pdE1ldGFkYXRhLmphdmE=)
 | `42.92% <ø> (ø)` | `43.00 <0.00> (ø)` | |
   |

[GitHub] [hudi] codecov-io edited a comment on pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



codecov-io edited a comment on pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#issuecomment-781103554


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2581?src=pr=h1) Report
   > Merging 
[#2581](https://codecov.io/gh/apache/hudi/pull/2581?src=pr=desc) (813fa19) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/43a0776c7c88a5f7beac6c8853db7e341810635a?el=desc)
 (43a0776) will **decrease** coverage by `41.45%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2581/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2581?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2581   +/-   ##
   
   - Coverage 51.14%   9.69%   -41.46% 
   + Complexity 3215  48 -3167 
   
 Files   438  53  -385 
 Lines 200411929-18112 
 Branches   2064 230 -1834 
   
   - Hits  10250 187-10063 
   + Misses 89461729 -7217 
   + Partials845  13  -832 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.69% <ø> (-59.78%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2581?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2581/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2581/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2581/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2581/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2581/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2581/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2581/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2581/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2581/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2581/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | `0.00% <0.00%>

[GitHub] [hudi] codecov-io edited a comment on pull request #1946: [HUDI-1176]Support log4j2 config

2021-02-21 Thread GitBox



codecov-io edited a comment on pull request #1946:
URL: https://github.com/apache/hudi/pull/1946#issuecomment-774846457







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #1946: [HUDI-1176]Support log4j2 config

2021-02-21 Thread GitBox



codecov-io edited a comment on pull request #1946:
URL: https://github.com/apache/hudi/pull/1946#issuecomment-774846457


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1946?src=pr=h1) Report
   > Merging 
[#1946](https://codecov.io/gh/apache/hudi/pull/1946?src=pr=desc) (2fb88d1) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/43a0776c7c88a5f7beac6c8853db7e341810635a?el=desc)
 (43a0776) will **increase** coverage by `18.32%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/1946/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1946?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#1946   +/-   ##
   =
   + Coverage 51.14%   69.46%   +18.32% 
   + Complexity 3215  356 -2859 
   =
 Files   438   53  -385 
 Lines 20041 1929-18112 
 Branches   2064  230 -1834 
   =
   - Hits  10250 1340 -8910 
   + Misses 8946  456 -8490 
   + Partials845  133  -712 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.46% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/1946?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...org/apache/hudi/utilities/HoodieClusteringJob.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZUNsdXN0ZXJpbmdKb2IuamF2YQ==)
 | `65.21% <ø> (ø)` | `9.00 <0.00> (ø)` | |
   | 
[.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==)
 | `88.79% <ø> (ø)` | `28.00 <0.00> (ø)` | |
   | 
[...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=)
 | `65.69% <ø> (ø)` | `32.00 <0.00> (ø)` | |
   | 
[...callback/kafka/HoodieWriteCommitKafkaCallback.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2NhbGxiYWNrL2thZmthL0hvb2RpZVdyaXRlQ29tbWl0S2Fma2FDYWxsYmFjay5qYXZh)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...udi/utilities/deltastreamer/BootstrapExecutor.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvQm9vdHN0cmFwRXhlY3V0b3IuamF2YQ==)
 | `79.54% <ø> (ø)` | `6.00 <0.00> (ø)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `70.00% <ø> (ø)` | `50.00 <0.00> (ø)` | |
   | 
[...i/utilities/deltastreamer/HoodieDeltaStreamer.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllRGVsdGFTdHJlYW1lci5qYXZh)
 | `68.72% <ø> (ø)` | `18.00 <0.00> (ø)` | |
   | 
[...s/deltastreamer/HoodieMultiTableDeltaStreamer.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllTXVsdGlUYWJsZURlbHRhU3RyZWFtZXIuamF2YQ==)
 | `78.39% <ø> (ø)` | `18.00 <0.00> (ø)` | |
   | 
[...tilities/deltastreamer/SchedulerConfGenerator.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvU2NoZWR1bGVyQ29uZkdlbmVyYXRvci5qYXZh)
 | `90.90% <ø> (ø)` | `8.00 <0.00> (ø)` | |
   | 
[...apache/hudi/utilities/sources/AvroKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/1946/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb0thZmthU291cmNlLmphdmE=)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | ... and

[jira] [Updated] (HUDI-1607) Decimal handling bug in SparkAvroPostProcessor

2021-02-21 Thread Jingwei Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jingwei Zhang updated HUDI-1607:

Description: 
This issue related to [#[Hudi-1343]|[https://github.com/apache/hudi/pull/2192].]

I think the purpose of Hudi-1343 was to bridge the difference between avro 
1.8.2(used by hudi) and avro 1.9.2(used by upstream system) thru internal 
Struct type. In particular, the incompatible form to express nullable type 
between those two versions. 

It was all good until I hit the type Decimal. Since it can either be FIXED or 
BYTES, if an avro schema contains decimal type with BYTES as its literal type, 
after this two way conversion its literal type become FIXED instead. This will 
cause an exception to be thrown in AvroConversionHelper as the data underneath 
is HeapByteBuffer rather than GenericFixed.

  was:
This issue related to [#[Hudi-1343]|[https://github.com/apache/hudi/pull/2192].]

I think the purpose of Hudi-1343 was to bridge the different between avro 
1.8.2(used by hudi) and avro 1.9.2(used by upstream system) thru internal 
Struct type. In particular, the incompatible form to express nullable type 
between those two versions. 

It was all good until I hit the type Decimal. Since it can either be FIXED or 
BYTES, if an avro schema contains decimal type with BYTES as its literal type, 
after this two way conversion its literal type become FIXED instead. This will 
cause an exception to be thrown in AvroConversionHelper as the data underneath 
is ByteBuffer rather than GenericFixed.


> Decimal handling bug in SparkAvroPostProcessor 
> ---
>
> Key: HUDI-1607
> URL: https://issues.apache.org/jira/browse/HUDI-1607
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jingwei Zhang
>Priority: Major
>
> This issue related to 
> [#[Hudi-1343]|[https://github.com/apache/hudi/pull/2192].]
> I think the purpose of Hudi-1343 was to bridge the difference between avro 
> 1.8.2(used by hudi) and avro 1.9.2(used by upstream system) thru internal 
> Struct type. In particular, the incompatible form to express nullable type 
> between those two versions. 
> It was all good until I hit the type Decimal. Since it can either be FIXED or 
> BYTES, if an avro schema contains decimal type with BYTES as its literal 
> type, after this two way conversion its literal type become FIXED instead. 
> This will cause an exception to be thrown in AvroConversionHelper as the data 
> underneath is HeapByteBuffer rather than GenericFixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] danny0405 commented on a change in pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



danny0405 commented on a change in pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#discussion_r579947920



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java
##
@@ -37,6 +38,29 @@
  */
 public class HoodieIndexUtils {
 
+  /**
+   * Fetches Pair of partition path and {@link HoodieBaseFile}s for interested 
partitions.
+   *
+   * @param partition   Partition of interest
+   * @param context Instance of {@link HoodieEngineContext} to use
+   * @param hoodieTable Instance of {@link HoodieTable} of interest
+   * @return the list of {@link HoodieBaseFile}
+   */
+  public static List getLatestBaseFilesForPartition(

Review comment:
   I think we could.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lhjzmn commented on a change in pull request #2532: [HUDI-1534]HiveSyncTool-It is not necessary to use JDBC and MetaStoreClient at the same time

2021-02-21 Thread GitBox



lhjzmn commented on a change in pull request #2532:
URL: https://github.com/apache/hudi/pull/2532#discussion_r579925386



##
File path: 
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
##
@@ -343,116 +304,48 @@ public boolean doesTableExist(String tableName) {
*/
   public boolean doesDataBaseExist(String databaseName) {
 try {
-  Database database = client.getDatabase(databaseName);
-  if (database != null && databaseName.equals(database.getName())) {
-return true;
-  }
+  List databases = client.getAllDatabases();
+  return databases.contains(databaseName.toLowerCase());
 } catch (TException e) {
+  LOG.error("Failed to check if database exists " + databaseName, e);
   throw new HoodieHiveSyncException("Failed to check if database exists " 
+ databaseName, e);
 }
-return false;
   }
 
-  /**
-   * Execute a update in hive metastore with this SQL.
-   *
-   * @param s SQL to execute
-   */
-  public void updateHiveSQL(String s) {
-if (syncConfig.useJdbc) {
-  Statement stmt = null;
-  try {
-stmt = connection.createStatement();
-LOG.info("Executing SQL " + s);
-stmt.execute(s);
-  } catch (SQLException e) {
-throw new HoodieHiveSyncException("Failed in executing SQL " + s, e);
-  } finally {
-closeQuietly(null, stmt);
-  }
-} else {
-  updateHiveSQLUsingHiveDriver(s);
+  public void createDataBase(String databaseName, String location, String 
description) {

Review comment:
   Yes, this is a problem. I have changed the spelling of Database used by 
Hive synchronization to Database (the second b is lowercase) and resubmit the 
code.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a change in pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



danny0405 commented on a change in pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#discussion_r579915210



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/partitioner/BucketAssignFunction.java
##
@@ -146,5 +209,69 @@ public void notifyCheckpointComplete(long l) {
 // Refresh the table state when there are new commits.
 this.bucketAssigner.reset();
 this.bucketAssigner.refreshTable();
+checkPartitionsLoaded();
+  }
+
+  /**
+   * Load all the indices of give partition path into the backup state.
+   *
+   * @param partitionPath The partition path
+   * @throws Exception when error occurs for state update
+   */
+  private void loadRecords(String partitionPath) throws Exception {
+HoodieTable hoodieTable = bucketAssigner.getTable();
+List latestBaseFiles =
+HoodieIndexUtils.getLatestBaseFilesForPartition(partitionPath, 
context, hoodieTable);
+for (HoodieBaseFile baseFile : latestBaseFiles) {
+  List hoodieKeys =
+  ParquetUtils.fetchRecordKeyPartitionPathFromParquet(hadoopConf, new 
Path(baseFile.getPath()));
+  hoodieKeys.forEach(hoodieKey -> {
+try {
+  this.indexState.put(hoodieKey, new 
HoodieRecordLocation(baseFile.getCommitTime(), baseFile.getFileId()));
+} catch (Exception e) {
+  throw new HoodieIOException("Error when load record keys from file: 
" + baseFile);
+}
+  });
+}
+// Mark the partition path as loaded.
+partitionLoadState.put(partitionPath, 0);

Review comment:
   > `The 0 is meaningless here`
   
   It is meaningless anyway, because Flink does not have Set state.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] lhjzmn commented on a change in pull request #2532: [HUDI-1534]HiveSyncTool-It is not necessary to use JDBC and MetaStoreClient at the same time

2021-02-21 Thread GitBox



lhjzmn commented on a change in pull request #2532:
URL: https://github.com/apache/hudi/pull/2532#discussion_r579912713



##
File path: 
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
##
@@ -64,32 +56,23 @@
 public class HoodieHiveClient extends AbstractSyncHoodieClient {
 
   private static final String HOODIE_LAST_COMMIT_TIME_SYNC = 
"last_commit_time_sync";
-  private static final String HIVE_ESCAPE_CHARACTER = 
HiveSchemaUtil.HIVE_ESCAPE_CHARACTER;
 
   private static final Logger LOG = 
LogManager.getLogger(HoodieHiveClient.class);
   private final PartitionValueExtractor partitionValueExtractor;
   private IMetaStoreClient client;
   private HiveSyncConfig syncConfig;
   private FileSystem fs;
-  private Connection connection;
   private HoodieTimeline activeTimeline;
-  private HiveConf configuration;
 
   public HoodieHiveClient(HiveSyncConfig cfg, HiveConf configuration, 
FileSystem fs) {
 super(cfg.basePath, cfg.assumeDatePartitioning, 
cfg.useFileListingFromMetadata, cfg.verifyMetadataFileListing, fs);
 this.syncConfig = cfg;
 this.fs = fs;
 
-this.configuration = configuration;
-// Support both JDBC and metastore based implementations for backwards 
compatiblity. Future users should
-// disable jdbc and depend on metastore client for all hive registrations
-if (cfg.useJdbc) {

Review comment:
   For Hive synchronization, any JDBC-related code and configuration have 
not been used after this change, so this content is removed from 
HoodieHiveClient.java. Welcome to continue the discussion @satishkotha 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] liujinhui1994 commented on pull request #2443: [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable

2021-02-21 Thread GitBox



liujinhui1994 commented on pull request #2443:
URL: https://github.com/apache/hudi/pull/2443#issuecomment-782990952


   @nsivabalan  Thanks for the suggestion, I will make changes



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

2021-02-21 Thread GitBox



teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r579910252



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider
 val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
 log.info("Obtained hudi table path: " + tablePath)
 
+val sparkEngineContext = new 
HoodieSparkEngineContext(sqlContext.sparkContext)
+val fsBackedTableMetadata =
+  new FileSystemBackedTableMetadata(sparkEngineContext, new 
SerializableConfiguration(fs.getConf), tablePath, false)
+val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
+val onePartitionPath = if(!partitionPaths.isEmpty && 
!StringUtils.isEmpty(partitionPaths.get(0))) {
+tablePath + "/" + partitionPaths.get(0)
+  } else {
+tablePath
+  }
+val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath)
+log.info("Obtained hudi data path: " + dataPath)
+parameters += "path" -> dataPath

Review comment:
   I will try to see if I can automatically infer this but also meet your 
needs





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1633) Make callback return HoodieWriteStat

2021-02-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1633:
-
Labels: pull-request-available  (was: )

> Make callback return HoodieWriteStat
> 
>
> Key: HUDI-1633
> URL: https://issues.apache.org/jira/browse/HUDI-1633
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] codecov-io edited a comment on pull request #2445: [HUDI-1633] Make callback return HoodieWriteStat

2021-02-21 Thread GitBox



codecov-io edited a comment on pull request #2445:
URL: https://github.com/apache/hudi/pull/2445#issuecomment-782598913


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2445?src=pr=h1) Report
   > Merging 
[#2445](https://codecov.io/gh/apache/hudi/pull/2445?src=pr=desc) (0ec3320) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/e3d3677b7e7899705b624925666317f0c074f7c7?el=desc)
 (e3d3677) will **increase** coverage by `18.73%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2445/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2445?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2445   +/-   ##
   =
   + Coverage 50.73%   69.46%   +18.73% 
   + Complexity 3064  356 -2708 
   =
 Files   419   53  -366 
 Lines 18797 1929-16868 
 Branches   1922  230 -1692 
   =
   - Hits   9536 1340 -8196 
   + Misses 8487  456 -8031 
   + Partials774  133  -641 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.46% <0.00%> (-0.02%)` | `0.00 <0.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2445?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...callback/kafka/HoodieWriteCommitKafkaCallback.java](https://codecov.io/gh/apache/hudi/pull/2445/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2NhbGxiYWNrL2thZmthL0hvb2RpZVdyaXRlQ29tbWl0S2Fma2FDYWxsYmFjay5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2445/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `70.00% <0.00%> (-0.87%)` | `50.00% <0.00%> (-1.00%)` | |
   | 
[...i/utilities/deltastreamer/HoodieDeltaStreamer.java](https://codecov.io/gh/apache/hudi/pull/2445/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllRGVsdGFTdHJlYW1lci5qYXZh)
 | `68.72% <0.00%> (-0.26%)` | `18.00% <0.00%> (ø%)` | |
   | 
[...oning/compaction/CompactionV2MigrationHandler.java](https://codecov.io/gh/apache/hudi/pull/2445/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvY29tcGFjdGlvbi9Db21wYWN0aW9uVjJNaWdyYXRpb25IYW5kbGVyLmphdmE=)
 | | | |
   | 
[...tSpotMemoryLayoutSpecification64bitCompressed.java](https://codecov.io/gh/apache/hudi/pull/2445/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvanZtL0hvdFNwb3RNZW1vcnlMYXlvdXRTcGVjaWZpY2F0aW9uNjRiaXRDb21wcmVzc2VkLmphdmE=)
 | | | |
   | 
[...i/common/model/OverwriteWithLatestAvroPayload.java](https://codecov.io/gh/apache/hudi/pull/2445/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL092ZXJ3cml0ZVdpdGhMYXRlc3RBdnJvUGF5bG9hZC5qYXZh)
 | | | |
   | 
[.../org/apache/hudi/cli/commands/TempViewCommand.java](https://codecov.io/gh/apache/hudi/pull/2445/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1RlbXBWaWV3Q29tbWFuZC5qYXZh)
 | | | |
   | 
[.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/hudi/pull/2445/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==)
 | | | |
   | 
[...pache/hudi/cli/commands/FileSystemViewCommand.java](https://codecov.io/gh/apache/hudi/pull/2445/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL0ZpbGVTeXN0ZW1WaWV3Q29tbWFuZC5qYXZh)
 | | | |
   | 
[...e/hudi/common/table/timeline/dto/FileGroupDTO.java](https://codecov.io/gh/apache/hudi/pull/2445/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL2R0by9GaWxlR3JvdXBEVE8uamF2YQ==)
 | | | |
   | ... and [360 
more](https://codecov.io/gh/apache/hudi/pull/2445/diff?src=pr=tree-more) | |
   



This is an automated

[GitHub] [hudi] teeyog commented on a change in pull request #2475: [HUDI-1527] automatically infer the data directory, users only need to specify the table directory

2021-02-21 Thread GitBox



teeyog commented on a change in pull request #2475:
URL: https://github.com/apache/hudi/pull/2475#discussion_r579907953



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -74,6 +78,19 @@ class DefaultSource extends RelationProvider
 val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
 log.info("Obtained hudi table path: " + tablePath)
 
+val sparkEngineContext = new 
HoodieSparkEngineContext(sqlContext.sparkContext)
+val fsBackedTableMetadata =
+  new FileSystemBackedTableMetadata(sparkEngineContext, new 
SerializableConfiguration(fs.getConf), tablePath, false)
+val partitionPaths = fsBackedTableMetadata.getAllPartitionPaths
+val onePartitionPath = if(!partitionPaths.isEmpty && 
!StringUtils.isEmpty(partitionPaths.get(0))) {
+tablePath + "/" + partitionPaths.get(0)
+  } else {
+tablePath
+  }
+val dataPath = DataSourceUtils.getDataPath(tablePath, onePartitionPath)
+log.info("Obtained hudi data path: " + dataPath)
+parameters += "path" -> dataPath

Review comment:
   The path specified by the user will be overwritten by the automatically 
inferred data directory, and your needs cannot be met





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] liujinhui1994 commented on pull request #2443: [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable

2021-02-21 Thread GitBox



liujinhui1994 commented on pull request #2443:
URL: https://github.com/apache/hudi/pull/2443#issuecomment-782985138


   > Just to confirm, we don't need to fix anything wrt HoodieHiveClient wrt 
this patch is it?
   
   yes



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan edited a comment on pull request #2208: [HUDI-1040] Make Hudi support Spark 3

2021-02-21 Thread GitBox



nsivabalan edited a comment on pull request #2208:
URL: https://github.com/apache/hudi/pull/2208#issuecomment-782984396


   Commenting here so that all interested folks will get notified. We are 
hitting an issue related to spark2.4.4 vs spark3 while trying out quick start 
in hudi for MOR table with spark_12 bundle. More details are 
[here](https://issues.apache.org/jira/browse/HUDI-1568). Running into 
NoSuchMethodError: 'void 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex. during read 
of MOR table. 
   I saw similar error was already quoted 
[here](https://github.com/apache/hudi/pull/1760#issuecomment-713993418) while 
spark 3 work was ongoing. so thought someone could help out. 
   Can someone help me out as to what am I missing while trying out quick start.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #2208: [HUDI-1040] Make Hudi support Spark 3

2021-02-21 Thread GitBox



nsivabalan commented on pull request #2208:
URL: https://github.com/apache/hudi/pull/2208#issuecomment-782984396


   Commenting here so that all interested folks will get notified. We are 
hitting an issue related to spark2.4.4 vs spark3 while trying out quick start 
in hudi for MOR table with spark_12 bundle. More details are 
[here](https://issues.apache.org/jira/browse/HUDI-1568). Running into 
NoSuchMethodError: 'void 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex. during read 
of MOR table. 
   
   Can someone help me out as to am I missing something while trying out quick 
start.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-1633) Make callback return HoodieWriteStat

2021-02-21 Thread liujinhui (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui reassigned HUDI-1633:
---

Assignee: liujinhui

> Make callback return HoodieWriteStat
> 
>
> Key: HUDI-1633
> URL: https://issues.apache.org/jira/browse/HUDI-1633
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1633) Make callback return HoodieWriteStat

2021-02-21 Thread liujinhui (Jira)

liujinhui created HUDI-1633:
---

 Summary: Make callback return HoodieWriteStat
 Key: HUDI-1633
 URL: https://issues.apache.org/jira/browse/HUDI-1633
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: liujinhui






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-1568) Issues w/ spark_bundle_2.12 : NoSuchMethodError: 'void org.apache.spark.sql.execution.datasources.InMemoryFileIndex.

2021-02-21 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288120#comment-17288120
 ] 

sivabalan narayanan edited comment on HUDI-1568 at 2/22/21, 1:44 AM:
-

 

Reported by customer as well : https://github.com/apache/hudi/issues/2566 


was (Author: shivnarayan):
Tried pyspark as per quick start utils for MERGE_ON_READ table. COW has no 
issues. Running into NosuchMethodError for

org.apache.spark.sql.execution.datasources.InMemoryFileIndex.

```

>>> df.write.format("hudi"). \

  options(**hudi_options). \

  mode("overwrite"). \

  save(basePath)

21/02/21 20:14:40 WARN HoodieSparkSqlWriter$: hoodie table at 
[file:/tmp/hudi_trips_cow|file:///tmp/hudi_trips_cow] already exists. Deleting 
existing data & overwriting with new data.

>>>                                                                             

>>> 

>>> tripsSnapshotDF = spark. \

  read. \

  format("hudi"). \

  load(basePath + "/*/*/*/*")

Traceback (most recent call last):

  File "", line 1, in 

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/readwriter.py",
 line 178, in load

    return self._df(self._jreader.load(path))

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1305, in __call__

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/utils.py", 
line 128, in deco

    return f(*a, **kw)

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
 line 328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o64.load.

: java.lang.NoSuchMethodError: 'void 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(org.apache.spark.sql.SparkSession,
 scala.collection.Seq, scala.collection.immutable.Map, scala.Option, 
org.apache.spark.sql.execution.datasources.FileStatusCache)'

at 
org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)

at 
org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)

at 
org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:72)

at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)

at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)

at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)

at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)

at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)

at scala.Option.getOrElse(Option.scala:189)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)

at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)

at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.base/java.lang.reflect.Method.invoke(Method.java:566)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:282)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:238)

at java.base/java.lang.Thread.run(Thread.java:834)

```

Tried the quick start utils for pyspark as is. 

 

These are the paths I had set 

```

alias python='python3'

alias python=/usr/local/bin/python3.9

alias pip=/usr/local/bin/pip3

export 
JAVA_HOME=/Library/java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/

export 
JRE_HOME=/Library/java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/jre/

export SPARK_HOME=/usr/local/Cellar/apache-spark/3.0.1/libexec/

export PATH=/usr/local/Cellar/apache-spark/3.0.1/bin:$PATH

```

 

 

 

> Issues w/ spark_bundle_2.12 : NoSuchMethodError: 'void 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex.
> --
>
> Key: HUDI-1568
> URL: https://issues.apache.org/jira/browse/HUDI-1568
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: sev:critical, user-support-issues
>
> I tried Quick Start with hudi-spark-bundle_2.12 and it fails w/
> NoSuchMethodError: 'void 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex. during 
> read.
> This is happening only for MOR and not for COW. 
>

[jira] [Updated] (HUDI-1568) Issues w/ spark_bundle_2.12 : NoSuchMethodError: 'void org.apache.spark.sql.execution.datasources.InMemoryFileIndex.

2021-02-21 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1568:
--
Labels: sev:critical user-support-issues  (was: )

> Issues w/ spark_bundle_2.12 : NoSuchMethodError: 'void 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex.
> --
>
> Key: HUDI-1568
> URL: https://issues.apache.org/jira/browse/HUDI-1568
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: sev:critical, user-support-issues
>
> I tried Quick Start with hudi-spark-bundle_2.12 and it fails w/
> NoSuchMethodError: 'void 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex. during 
> read.
> This is happening only for MOR and not for COW. 
> Command used to invoke spark shell
>   
> spark-3.0.1-bin-hadoop2.7/bin/spark-shell \
>  --packages 
> org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1
>  \
>  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>   my local spark is spark-3.0.1.
> {code:java}
> . // usual steps as given in quick start utils.
> .
> .
> scala> df.write.format("hudi").
>      |   options(getQuickstartWriteConfigs).
>      |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>      |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>      |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>      |   option(TABLE_NAME, tableName).
>  |   option("hoodie.datasource.write.table.type","MERGE_ON_READ")
>      |   mode(Overwrite).
>      |   save(basePath)
> val tripsSnapshotDF = spark.
>   read.
>   format("hudi").
>   load(basePath + "/*/*/*/*")
> java.lang.NoSuchMethodError: 'void 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(org.apache.spark.sql.SparkSession,
>  scala.collection.Seq, scala.collection.immutable.Map, scala.Option, 
> org.apache.spark.sql.execution.datasources.FileStatusCache)'
>   at 
> org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
>   at 
> org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
>   at 
> org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:72)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
>   ... 62 elided
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] nsivabalan edited a comment on issue #2566: [SUPPORT] Unable to read Hudi MOR data set in a test on 0.7

2021-02-21 Thread GitBox



nsivabalan edited a comment on issue #2566:
URL: https://github.com/apache/hudi/issues/2566#issuecomment-782977872


   I am still triaging the issue. here is the tracking ticket in the mean time: 
https://issues.apache.org/jira/browse/HUDI-1568. Happening even for spark. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1568) Issues w/ spark_bundle_2.12 : NoSuchMethodError: 'void org.apache.spark.sql.execution.datasources.InMemoryFileIndex.

2021-02-21 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1568:
--
Description: 
I tried Quick Start with hudi-spark-bundle_2.12 and it fails w/

NoSuchMethodError: 'void 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex. during read.

This is happening only for MOR and not for COW. 

Command used to invoke spark shell
  

spark-3.0.1-bin-hadoop2.7/bin/spark-shell \
 --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1
 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
  my local spark is spark-3.0.1.
{code:java}
. // usual steps as given in quick start utils.
.
.
scala> df.write.format("hudi").
     |   options(getQuickstartWriteConfigs).
     |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |   option(TABLE_NAME, tableName).
 |   option("hoodie.datasource.write.table.type","MERGE_ON_READ")
     |   mode(Overwrite).
     |   save(basePath)

val tripsSnapshotDF = spark.
  read.
  format("hudi").
  load(basePath + "/*/*/*/*")

java.lang.NoSuchMethodError: 'void 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(org.apache.spark.sql.SparkSession,
 scala.collection.Seq, scala.collection.immutable.Map, scala.Option, 
org.apache.spark.sql.execution.datasources.FileStatusCache)'
  at 
org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)
  at 
org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)
  at 
org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:72)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
  at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
  ... 62 elided

{code}
 

 

 

  was:
I tried Quick Start with hudi-spark-bundle_2.12 and it fails w/ 
ClassNotFoundError for

org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.

hudi-spark_bundle_2.11 works fine. 

 

Command used to invoke spark shell
  spark-shell \
 --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1
 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
  my local spark is spark-3.0.1.
{code:java}
scala> df.write.format("hudi").
     |   options(getQuickstartWriteConfigs).
     |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
     |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
     |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
     |   option(TABLE_NAME, tableName).
     |   mode(Overwrite).
     |   save(basePath)
java.lang.NoClassDefFoundError: 
org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2
  at java.lang.ClassLoader.defineClass1(Native Method)
  at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
  at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
  at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
  at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:348)
  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
  at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
  at 
scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
  at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
  at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
  at

[GitHub] [hudi] nsivabalan edited a comment on issue #2566: [SUPPORT] Unable to read Hudi MOR data set in a test on 0.7

2021-02-21 Thread GitBox



nsivabalan edited a comment on issue #2566:
URL: https://github.com/apache/hudi/issues/2566#issuecomment-782977872


   I am still triaging the issue. here is the tracking ticket in the mean time: 
https://issues.apache.org/jira/browse/HUDI-1568. Happening even for quick start 
utils in spark. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1568) Issues w/ spark_bundle_2.12 : NoSuchMethodError: 'void org.apache.spark.sql.execution.datasources.InMemoryFileIndex.

2021-02-21 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1568:
--
Summary: Issues w/ spark_bundle_2.12 : NoSuchMethodError: 'void 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.  (was: 
Issues w/ spark_bundle_2.12 : ClassNotFoundError for 
org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2)

> Issues w/ spark_bundle_2.12 : NoSuchMethodError: 'void 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex.
> --
>
> Key: HUDI-1568
> URL: https://issues.apache.org/jira/browse/HUDI-1568
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: sivabalan narayanan
>Priority: Major
>
> I tried Quick Start with hudi-spark-bundle_2.12 and it fails w/ 
> ClassNotFoundError for
> org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.
> hudi-spark_bundle_2.11 works fine. 
>  
> Command used to invoke spark shell
>   spark-shell \
>  --packages 
> org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1
>  \
>  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>   my local spark is spark-3.0.1.
> {code:java}
> scala> df.write.format("hudi").
>      |   options(getQuickstartWriteConfigs).
>      |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>      |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>      |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>      |   option(TABLE_NAME, tableName).
>      |   mode(Overwrite).
>      |   save(basePath)
> java.lang.NoClassDefFoundError: 
> org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
>   at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>   at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
>   at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:245)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
>   ... 68 elided
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 96 more
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] nsivabalan edited a comment on issue #2566: [SUPPORT] Unable to read Hudi MOR data set in a test on 0.7

2021-02-21 Thread GitBox



nsivabalan edited a comment on issue #2566:
URL: https://github.com/apache/hudi/issues/2566#issuecomment-782977872


   I am still triaging the issue. here is the tracking ticket in the mean time: 
https://issues.apache.org/jira/browse/HUDI-1568. Check the comment section.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2566: [SUPPORT] Unable to read Hudi MOR data set in a test on 0.7

2021-02-21 Thread GitBox



nsivabalan commented on issue #2566:
URL: https://github.com/apache/hudi/issues/2566#issuecomment-782977872


   I am still triaging the issue. here is the tracking ticket: 
https://issues.apache.org/jira/browse/HUDI-1568



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Comment Edited] (HUDI-1568) Issues w/ spark_bundle_2.12 : ClassNotFoundError for org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2

2021-02-21 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288120#comment-17288120
 ] 

sivabalan narayanan edited comment on HUDI-1568 at 2/22/21, 1:20 AM:
-

Tried pyspark as per quick start utils for MERGE_ON_READ table. Running into 
NosuchMethodError for

org.apache.spark.sql.execution.datasources.InMemoryFileIndex.

```

>>> df.write.format("hudi"). \

  options(**hudi_options). \

  mode("overwrite"). \

  save(basePath)

21/02/21 20:14:40 WARN HoodieSparkSqlWriter$: hoodie table at 
[file:/tmp/hudi_trips_cow|file:///tmp/hudi_trips_cow] already exists. Deleting 
existing data & overwriting with new data.

>>>                                                                             

>>> 

>>> tripsSnapshotDF = spark. \

  read. \

  format("hudi"). \

  load(basePath + "/*/*/*/*")

Traceback (most recent call last):

  File "", line 1, in 

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/readwriter.py",
 line 178, in load

    return self._df(self._jreader.load(path))

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1305, in __call__

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/utils.py", 
line 128, in deco

    return f(*a, **kw)

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
 line 328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o64.load.

: java.lang.NoSuchMethodError: 'void 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(org.apache.spark.sql.SparkSession,
 scala.collection.Seq, scala.collection.immutable.Map, scala.Option, 
org.apache.spark.sql.execution.datasources.FileStatusCache)'

at 
org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)

at 
org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)

at 
org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:72)

at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)

at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)

at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)

at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)

at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)

at scala.Option.getOrElse(Option.scala:189)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)

at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)

at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.base/java.lang.reflect.Method.invoke(Method.java:566)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:282)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:238)

at java.base/java.lang.Thread.run(Thread.java:834)

```

Tried the quick start utils for pyspark as is. 

 

These are the paths I had set 

```

alias python='python3'

alias python=/usr/local/bin/python3.9

alias pip=/usr/local/bin/pip3

export 
JAVA_HOME=/Library/java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/

export 
JRE_HOME=/Library/java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/jre/

export SPARK_HOME=/usr/local/Cellar/apache-spark/3.0.1/libexec/

export PATH=/usr/local/Cellar/apache-spark/3.0.1/bin:$PATH

```

 

 

 


was (Author: shivnarayan):
Tried pyspark as per quick start utils. Running into NosuchMethodError for

org.apache.spark.sql.execution.datasources.InMemoryFileIndex.

```

>>> df.write.format("hudi"). \

  options(**hudi_options). \

  mode("overwrite"). \

  save(basePath)

21/02/21 20:14:40 WARN HoodieSparkSqlWriter$: hoodie table at 
file:/tmp/hudi_trips_cow already exists. Deleting existing data & overwriting 
with new data.

>>>                                                                             

>>> 

>>> tripsSnapshotDF = spark. \

  read. \

  format("hudi"). \

  load(basePath + "/*/*/*/*")

Traceback (most recent call last):

  File "", line 1, in 

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/readwriter.py",
 line 178, in load

    return self._df(self._jreader.load(path))

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",

[jira] [Comment Edited] (HUDI-1568) Issues w/ spark_bundle_2.12 : ClassNotFoundError for org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2

2021-02-21 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288120#comment-17288120
 ] 

sivabalan narayanan edited comment on HUDI-1568 at 2/22/21, 1:20 AM:
-

Tried pyspark as per quick start utils for MERGE_ON_READ table. COW has no 
issues. Running into NosuchMethodError for

org.apache.spark.sql.execution.datasources.InMemoryFileIndex.

```

>>> df.write.format("hudi"). \

  options(**hudi_options). \

  mode("overwrite"). \

  save(basePath)

21/02/21 20:14:40 WARN HoodieSparkSqlWriter$: hoodie table at 
[file:/tmp/hudi_trips_cow|file:///tmp/hudi_trips_cow] already exists. Deleting 
existing data & overwriting with new data.

>>>                                                                             

>>> 

>>> tripsSnapshotDF = spark. \

  read. \

  format("hudi"). \

  load(basePath + "/*/*/*/*")

Traceback (most recent call last):

  File "", line 1, in 

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/readwriter.py",
 line 178, in load

    return self._df(self._jreader.load(path))

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1305, in __call__

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/utils.py", 
line 128, in deco

    return f(*a, **kw)

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
 line 328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o64.load.

: java.lang.NoSuchMethodError: 'void 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(org.apache.spark.sql.SparkSession,
 scala.collection.Seq, scala.collection.immutable.Map, scala.Option, 
org.apache.spark.sql.execution.datasources.FileStatusCache)'

at 
org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)

at 
org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)

at 
org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:72)

at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)

at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)

at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)

at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)

at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)

at scala.Option.getOrElse(Option.scala:189)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)

at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)

at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)

at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.base/java.lang.reflect.Method.invoke(Method.java:566)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:282)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:238)

at java.base/java.lang.Thread.run(Thread.java:834)

```

Tried the quick start utils for pyspark as is. 

 

These are the paths I had set 

```

alias python='python3'

alias python=/usr/local/bin/python3.9

alias pip=/usr/local/bin/pip3

export 
JAVA_HOME=/Library/java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/

export 
JRE_HOME=/Library/java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/jre/

export SPARK_HOME=/usr/local/Cellar/apache-spark/3.0.1/libexec/

export PATH=/usr/local/Cellar/apache-spark/3.0.1/bin:$PATH

```

 

 

 


was (Author: shivnarayan):
Tried pyspark as per quick start utils for MERGE_ON_READ table. Running into 
NosuchMethodError for

org.apache.spark.sql.execution.datasources.InMemoryFileIndex.

```

>>> df.write.format("hudi"). \

  options(**hudi_options). \

  mode("overwrite"). \

  save(basePath)

21/02/21 20:14:40 WARN HoodieSparkSqlWriter$: hoodie table at 
[file:/tmp/hudi_trips_cow|file:///tmp/hudi_trips_cow] already exists. Deleting 
existing data & overwriting with new data.

>>>                                                                             

>>> 

>>> tripsSnapshotDF = spark. \

  read. \

  format("hudi"). \

  load(basePath + "/*/*/*/*")

Traceback (most recent call last):

  File "", line 1, in 

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/readwriter.py",
 line 178, in load

    return self._df(self._jreader.load(path))

  File

[jira] [Commented] (HUDI-1568) Issues w/ spark_bundle_2.12 : ClassNotFoundError for org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2

2021-02-21 Thread sivabalan narayanan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288120#comment-17288120
 ] 

sivabalan narayanan commented on HUDI-1568:
---

Tried pyspark as per quick start utils. Running into NosuchMethodError for

org.apache.spark.sql.execution.datasources.InMemoryFileIndex.

```

>>> df.write.format("hudi"). \

  options(**hudi_options). \

  mode("overwrite"). \

  save(basePath)

21/02/21 20:14:40 WARN HoodieSparkSqlWriter$: hoodie table at 
file:/tmp/hudi_trips_cow already exists. Deleting existing data & overwriting 
with new data.

>>>                                                                             

>>> 

>>> tripsSnapshotDF = spark. \

  read. \

  format("hudi"). \

  load(basePath + "/*/*/*/*")

Traceback (most recent call last):

  File "", line 1, in 

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/readwriter.py",
 line 178, in load

    return self._df(self._jreader.load(path))

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1305, in __call__

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/sql/utils.py", 
line 128, in deco

    return f(*a, **kw)

  File 
"/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
 line 328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o64.load.

: java.lang.NoSuchMethodError: 'void 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(org.apache.spark.sql.SparkSession,
 scala.collection.Seq, scala.collection.immutable.Map, scala.Option, 
org.apache.spark.sql.execution.datasources.FileStatusCache)'

 at 
org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89)

 at 
org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127)

 at 
org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:72)

 at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89)

 at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53)

 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)

 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)

 at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)

 at scala.Option.getOrElse(Option.scala:189)

 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)

 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)

 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)

 at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.base/java.lang.reflect.Method.invoke(Method.java:566)

 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

 at py4j.Gateway.invoke(Gateway.java:282)

 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

 at py4j.commands.CallCommand.execute(CallCommand.java:79)

 at py4j.GatewayConnection.run(GatewayConnection.java:238)

 at java.base/java.lang.Thread.run(Thread.java:834)

```

Tried the quick start utils for pyspark as is. 

 

These are the paths I had set 

```

alias python='python3'

alias python=/usr/local/bin/python3.9

alias pip=/usr/local/bin/pip3

export 
JAVA_HOME=/Library/java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/

export 
JRE_HOME=/Library/java/JavaVirtualMachines/jdk1.8.0_192.jdk/Contents/Home/jre/

export SPARK_HOME=/usr/local/Cellar/apache-spark/3.0.1/libexec/

export PATH=/usr/local/Cellar/apache-spark/3.0.1/bin:$PATH

```

 

 

 

> Issues w/ spark_bundle_2.12 : ClassNotFoundError for 
> org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2
> ---
>
> Key: HUDI-1568
> URL: https://issues.apache.org/jira/browse/HUDI-1568
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: sivabalan narayanan
>Priority: Major
>
> I tried Quick Start with hudi-spark-bundle_2.12 and it fails w/ 
> ClassNotFoundError for
> org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.
> hudi-spark_bundle_2.11 works fine. 
>  
> Command used to invoke spark shell
>   spark-shell \
>  --packages 
> org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1
>  \
>  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
>   my local spark is spark-3.0.1.
> {code:java}
> scala>

[GitHub] [hudi] Rap70r edited a comment on issue #2586: [SUPPORT] - How to guarantee snapshot isolation when reading Hudi tables in S3?

2021-02-21 Thread GitBox



Rap70r edited a comment on issue #2586:
URL: https://github.com/apache/hudi/issues/2586#issuecomment-782892789


   Hello nsivabalan, thank you for getting back to me.
   
   * All Hudi tables are stored in S3 buckets. We use Spark Structured 
Streaming to apply incremental updates against S3 Hudi datasets.
   
   * **Stacktrace**
   
   ```
   org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to 
download file path: 
s3://bucket_name/folder_name/table_name/some_partition/some_parquet_file.parquet,
 range: 0-515243, partition values: [empty row], isDataPresent: false
at 
org.apache.spark.sql.execution.datasources.AsyncFileDownloader.next(AsyncFileDownloader.scala:142)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.getNextFile(FileScanRDD.scala:252)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:132)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.FileNotFoundException: No such file or directory 
's3://bucket_name/folder_name/table_name/some_partition/some_parquet_file.parquet'
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:473)
at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:694)
at 
org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:449)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildPrefetcherWithPartitionValues$1(ParquetFileFormat.scala:492)
at 
org.apache.spark.sql.execution.datasources.AsyncFileDownloader.org$apache$spark$sql$execution$datasources$AsyncFileDownloader$$downloadFile(AsyncFileDownloader.scala:93)
at 
org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anon$1.call(AsyncFileDownloader.scala:73)
at 
org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anon$1.call(AsyncFileDownloader.scala:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   ```
   
   * Writer Configs:
   
   ```
   Output Path: S3 path
   hoodie.datasource.write.operation: upsert
   parallelism: 3000
   hoodie.datasource.write.table.type: COPY_ON_WRITE
   hoodie.cleaner.policy: KEEP_LATEST_FILE_VERSIONS
   File Version Retained: 1
   hoodie.datasource.hive_sync.enable: false
   SaveMode: Append
   partitionBy: Single Column
   ```
   
   * Reader Configs:
   
   ```
   val df = ss.read
.format("org.apache.hudi")
.option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
.load("s3://path/to/hudi/table/*/*")
 
   df.createOrReplaceTempView("hudi_table")
   ```
   
   * At any point of time, this setup has just a single writer. However, the 
writer applies incremental upserts very frequently, but the consecutive upsert 
jobs do not overlap. This means that while a reader might be performing time 
consuming queries on a Hudi dataset Spark dataframe, the writer has time to 
finish multiple times.
   
   **To Reproduce**
   
   * Load the dataframe using Hudi:
   ```
   val df = ss.read
.format("org.apache.hudi")
.option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
.load("s3://path/to/hudi/table/*/*")
 
   df.createOrReplaceTempView("hudi_table")
   ```
   
   * Apply time consuming Spark SQL queries against 'hudi_table'
   
   * A different Spark process updates Hudi dataset incrementally.
   
   * After upsert is done, if the time consuming query is still

[GitHub] [hudi] codecov-io edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers

2021-02-21 Thread GitBox



codecov-io edited a comment on pull request #2374:
URL: https://github.com/apache/hudi/pull/2374#issuecomment-750782300


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=h1) Report
   > Merging 
[#2374](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=desc) (15e2502) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/ffcfb58bacab377bc72d20041baa54a3fd8fc812?el=desc)
 (ffcfb58) will **decrease** coverage by `0.38%`.
   > The diff coverage is `1.28%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2374/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2374  +/-   ##
   
   - Coverage 51.15%   50.76%   -0.39% 
   + Complexity 3215 3204  -11 
   
 Files   438  447   +9 
 Lines 2004320191 +148 
 Branches   2065 2075  +10 
   
   - Hits  1025210249   -3 
   - Misses 8946 9097 +151 
 Partials845  845  
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `36.87% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.99% <1.28%> (-0.39%)` | `0.00 <0.00> (ø)` | |
   | hudiflink | `45.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.16% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `69.75% <50.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | hudisync | `45.37% <0.00%> (-3.25%)` | `0.00 <0.00> (ø)` | |
   | huditimelineservice | `66.49% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.36% <0.00%> (-0.11%)` | `0.00 <0.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...g/apache/hudi/common/config/LockConfiguration.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9Mb2NrQ29uZmlndXJhdGlvbi5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...java/org/apache/hudi/common/lock/LockProvider.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2xvY2svTG9ja1Byb3ZpZGVyLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...in/java/org/apache/hudi/common/lock/LockState.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2xvY2svTG9ja1N0YXRlLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...apache/hudi/common/model/HoodieCommitMetadata.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUNvbW1pdE1ldGFkYXRhLmphdmE=)
 | `43.49% <ø> (+0.57%)` | `32.00 <0.00> (-11.00)` | :arrow_up: |
   | 
[...apache/hudi/common/model/HoodieCommonMetadata.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUNvbW1vbk1ldGFkYXRhLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...ava/org/apache/hudi/common/model/TableService.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL1RhYmxlU2VydmljZS5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...apache/hudi/common/model/WriteConcurrencyMode.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL1dyaXRlQ29uY3VycmVuY3lNb2RlLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...i/common/table/timeline/HoodieDefaultTimeline.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZURlZmF1bHRUaW1lbGluZS5qYXZh)
 | `82.66% <0.00%> (-2.27%)` | `59.00 <0.00> (ø)` | |
   | 
[...che/hudi/common/table/timeline/HoodieTimeline.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZVRpbWVsaW5lLmphdmE=)
 | `91.30% <ø> (ø)` | `44.00 <0.00> (ø)` | |
   |

[GitHub] [hudi] codecov-io edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers

2021-02-21 Thread GitBox



codecov-io edited a comment on pull request #2374:
URL: https://github.com/apache/hudi/pull/2374#issuecomment-750782300







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers

2021-02-21 Thread GitBox



codecov-io edited a comment on pull request #2374:
URL: https://github.com/apache/hudi/pull/2374#issuecomment-750782300


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=h1) Report
   > Merging 
[#2374](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=desc) (22b7a6a) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/ffcfb58bacab377bc72d20041baa54a3fd8fc812?el=desc)
 (ffcfb58) will **increase** coverage by `18.31%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2374/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2374   +/-   ##
   =
   + Coverage 51.15%   69.46%   +18.31% 
   + Complexity 3215  356 -2859 
   =
 Files   438   53  -385 
 Lines 20043 1929-18114 
 Branches   2065  230 -1835 
   =
   - Hits  10252 1340 -8912 
   + Misses 8946  456 -8490 
   + Partials845  133  -712 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.46% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `70.00% <0.00%> (ø)` | `50.00 <0.00> (ø)` | |
   | 
[...tSpotMemoryLayoutSpecification64bitCompressed.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvanZtL0hvdFNwb3RNZW1vcnlMYXlvdXRTcGVjaWZpY2F0aW9uNjRiaXRDb21wcmVzc2VkLmphdmE=)
 | | | |
   | 
[.../hudi/common/config/SerializableConfiguration.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9TZXJpYWxpemFibGVDb25maWd1cmF0aW9uLmphdmE=)
 | | | |
   | 
[.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==)
 | | | |
   | 
[.../spark/sql/hudi/streaming/HoodieSourceOffset.scala](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9zcGFyay9zcWwvaHVkaS9zdHJlYW1pbmcvSG9vZGllU291cmNlT2Zmc2V0LnNjYWxh)
 | | | |
   | 
[.../java/org/apache/hudi/common/util/HoodieTimer.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvSG9vZGllVGltZXIuamF2YQ==)
 | | | |
   | 
[...a/org/apache/hudi/common/util/collection/Pair.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvY29sbGVjdGlvbi9QYWlyLmphdmE=)
 | | | |
   | 
[...java/org/apache/hudi/common/engine/EngineType.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9FbmdpbmVUeXBlLmphdmE=)
 | | | |
   | 
[...ava/org/apache/hudi/cli/commands/TableCommand.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1RhYmxlQ29tbWFuZC5qYXZh)
 | | | |
   | 
[...ain/scala/org/apache/hudi/cli/DedupeSparkJob.scala](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS1jbGkvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL2NsaS9EZWR1cGVTcGFya0pvYi5zY2FsYQ==)
 | | | |
   | ... and [370 
more](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codecov-io edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers

2021-02-21 Thread GitBox



codecov-io edited a comment on pull request #2374:
URL: https://github.com/apache/hudi/pull/2374#issuecomment-750782300


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=h1) Report
   > Merging 
[#2374](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=desc) (22b7a6a) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/ffcfb58bacab377bc72d20041baa54a3fd8fc812?el=desc)
 (ffcfb58) will **decrease** coverage by `41.45%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2374/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2374   +/-   ##
   
   - Coverage 51.15%   9.69%   -41.46% 
   + Complexity 3215  48 -3167 
   
 Files   438  53  -385 
 Lines 200431929-18114 
 Branches   2065 230 -1835 
   
   - Hits  10252 187-10065 
   + Misses 89461729 -7217 
   + Partials845  13  -832 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.69% <0.00%> (-59.78%)` | `0.00 <0.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2374?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `0.00% <0.00%> (-70.00%)` | `0.00 <0.00> (-50.00)` | |
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` |

[GitHub] [hudi] rubenssoto opened a new issue #2588: [SUPPORT] Cannot create hive connection

2021-02-21 Thread GitBox



rubenssoto opened a new issue #2588:
URL: https://github.com/apache/hudi/issues/2588


   Hello People,
   
   Emr 6.1
   Spark 3.0.0
   Hudi 0.7.0
   
   My EMR is configured to use glue catalog as a metastore, most of the time my 
Hudi jobs working great, but sometimes I have problems on hive connection:
   ```
   21/02/21 09:30:10 ERROR Client: Application diagnostics message: User class 
threw exception: java.lang.Exception: Error on Table: mobile_checkout, Error 
Message: org.apache.hudi.hive.HoodieHiveSyncException: Cannot create hive 
connection jdbc:hive2://ip-10-0-49-168.us-west-2.compute.internal:1/
at jobs.TableProcessor.start(TableProcessor.scala:95)
at 
TableProcessorWrapper$.$anonfun$main$2(TableProcessorWrapper.scala:23)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at 
java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
   
   Exception in thread "main" org.apache.spark.SparkException: Application 
application_1613496813774_2019 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1191)
at 
org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1582)
at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:936)
at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1015)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1024)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   21/02/21 09:30:10 INFO ShutdownHookManager: Shutdown hook called
   ```
   
   I open a ticket in AWS, the ticket is open for more than 2 weeks and was not 
solved yet.
   I know that is not a Hudi problem, but do you have any idea how to solve 
this?
   Is there another way in Hudi to sync with Hive?
   I have been using Hive with spark for some months and I never saw an error 
like this.
   
   
   Thank you, so much!!!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2566: [SUPPORT] Unable to read Hudi MOR data set in a test on 0.7

2021-02-21 Thread GitBox



nsivabalan commented on issue #2566:
URL: https://github.com/apache/hudi/issues/2566#issuecomment-782900702


   got it. yes, I could able to repro now. will investigate further. thnx.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] jtmzheng commented on issue #2566: [SUPPORT] Unable to read Hudi MOR data set in a test on 0.7

2021-02-21 Thread GitBox



jtmzheng commented on issue #2566:
URL: https://github.com/apache/hudi/issues/2566#issuecomment-782897494


   @nsivabalan ""hoodie.datasource.write.keygenerator.class" needs to be 
"org.apache.hudi.keygen.ComplexKeyGenerator" (I was able to reproduce the 
stacktrace you provided when I set it to SimpleKeyGenerator, I'm guessing 
because partition path is a composite key in my example). Thanks for 
investigating!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan edited a comment on pull request #2445: [MINOR] Callback message add partitionPath Field

2021-02-21 Thread GitBox



nsivabalan edited a comment on pull request #2445:
URL: https://github.com/apache/hudi/pull/2445#issuecomment-782896085


   yeah, I missed to see the MINOR. 
   but I did check the code path to ensure we don't trigger any dag actions as 
this is List\.
   realizing that this is called before postWrite, within commitstats(). So, 
wanted to hear your opinion if we this is fine or should we disallow sending 
entire WriteStats to the callback. 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #2445: [MINOR] Callback message add partitionPath Field

2021-02-21 Thread GitBox



nsivabalan commented on pull request #2445:
URL: https://github.com/apache/hudi/pull/2445#issuecomment-782896085


   yeah, I missed to see the MINOR. 
   but I did check the code path to ensure we don't trigger any dag actions as 
this is List.
   realizing that this is called before postWrite, within commitstats(). So, 
wanted to hear your opinion if we this is fine or should we disallow sending 
entire WriteStats to the callback. 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Rap70r commented on issue #2586: [SUPPORT] - How to guarantee snapshot isolation when reading Hudi tables in S3?

2021-02-21 Thread GitBox



Rap70r commented on issue #2586:
URL: https://github.com/apache/hudi/issues/2586#issuecomment-782892789


   Hello nsivabalan, thank you for getting back to me.
   
   * All Hudi tables are stored in S3 buckets. We use Spark Structured 
Streaming to apply incremental updates against S3 Hudi datasets.
   
   * **Stacktrace**
   
   ```
   org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to 
download file path: 
s3://bucket_name/folder_name/table_name/some_partition/some_parquet_file.parquet,
 range: 0-515243, partition values: [empty row], isDataPresent: false
at 
org.apache.spark.sql.execution.datasources.AsyncFileDownloader.next(AsyncFileDownloader.scala:142)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.getNextFile(FileScanRDD.scala:252)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:132)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:179)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
   Caused by: java.io.FileNotFoundException: No such file or directory 
's3://bucket_name/folder_name/table_name/some_partition/some_parquet_file.parquet'
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:473)
at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:694)
at 
org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:449)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildPrefetcherWithPartitionValues$1(ParquetFileFormat.scala:492)
at 
org.apache.spark.sql.execution.datasources.AsyncFileDownloader.org$apache$spark$sql$execution$datasources$AsyncFileDownloader$$downloadFile(AsyncFileDownloader.scala:93)
at 
org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anon$1.call(AsyncFileDownloader.scala:73)
at 
org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anon$1.call(AsyncFileDownloader.scala:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   ```
   
   * Writer Configs:
   
   ```
   Output Path: S3 path
   hoodie.datasource.write.operation: upsert
   parallelism: 3000
   hoodie.datasource.write.table.type: COPY_ON_WRITE
   hoodie.cleaner.policy: KEEP_LATEST_FILE_VERSIONS
   File Version Retained: 1
   hoodie.datasource.hive_sync.enable: false
   SaveMode: Append
   partitionBy: Single Column
   ```
   
   * Reader Configs:
   
   ```
   val df = ss.read
.format("org.apache.hudi")
.option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
.load("s3://path/to/hudi/table/*/*")
 
   df.createOrReplaceTempView("hudi_table")
   ```
   
   * At any point of time, this setup has just a single writer.
   
   **To Reproduce**
   
   * Load the dataframe using Hudi:
   ```
   val df = ss.read
.format("org.apache.hudi")
.option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
.load("s3://path/to/hudi/table/*/*")
 
   df.createOrReplaceTempView("hudi_table")
   ```
   
   * Apply time consuming Spark SQL queries against 'hudi_table'
   
   * A different Spark process updates Hudi dataset incrementally.
   
   * After upsert is done, if the time consuming query is still running, it 
will crash with below error:
   
   ```
   org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to 
download file path: 
s3://bucket_name/folder_name/table_name/some_partition/some_parquet_file.parquet,
 range: 0-515243, partition values: [empty

[GitHub] [hudi] nsivabalan commented on issue #2566: [SUPPORT] Unable to read Hudi MOR data set in a test on 0.7

2021-02-21 Thread GitBox



nsivabalan commented on issue #2566:
URL: https://github.com/apache/hudi/issues/2566#issuecomment-782891622


   @jtmzheng : I haven't tried py test before. I am trying to repro and running 
into issues wrt parquet schema missing. Can you tell me if I am missing 
something. May be I am making some rookie mistake. can you help me out. 
   
   ```
   docker build -f hudi.Dockerfile -t test_hudi .
   docker run test_hudi py.test -s --verbose test_hudi.py
   ```
   
   File contents:
   https://gist.github.com/nsivabalan/1b768c324691456605473b556d53c9a8
   
   stack trace: 
https://gist.github.com/nsivabalan/74476a48493fac2a9b8fdea4cd973ca9



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #2445: [MINOR] Callback message add partitionPath Field

2021-02-21 Thread GitBox



nsivabalan commented on pull request #2445:
URL: https://github.com/apache/hudi/pull/2445#issuecomment-782890593


   @yanghua: feel free to land this once your feedback is addressed



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #2445: [MINOR] Callback message add partitionPath Field

2021-02-21 Thread GitBox



nsivabalan commented on a change in pull request #2445:
URL: https://github.com/apache/hudi/pull/2445#discussion_r579835900



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/callback/common/HoodieWriteCommitCallbackMessage.java
##
@@ -73,4 +82,12 @@ public String getBasePath() {
   public void setBasePath(String basePath) {
 this.basePath = basePath;
   }
+
+  public List getHoodieWriteStat() {
+return hoodieWriteStat;
+  }
+
+  public void setHoodieWriteStat(List hoodieWriteStat) {

Review comment:
   Can you help me understand why do we need this setter? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hotienvu commented on pull request #2210: [HUDI-1348] Provide option to clean up DFS sources

2021-02-21 Thread GitBox



hotienvu commented on pull request #2210:
URL: https://github.com/apache/hudi/pull/2210#issuecomment-782864578


   > @hotienvu : If you are busy, I can look at addressing any pending feedback 
and looking into the build failure. Will push updates to the patch w/ any fixes 
if you are ok.
   
   Hi @nsivabalan: Sorry for late reply, yes your help is much appreciated 
since I'm not quite familiar with the integration test suite



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on pull request #2382: [HUDI-1477] Support CopyOnWriteTable in java client

2021-02-21 Thread GitBox



leesf commented on pull request #2382:
URL: https://github.com/apache/hudi/pull/2382#issuecomment-782860435


   @nsivabalan would you also want to take a pass here?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf commented on a change in pull request #2560: [HUDI-1606]align BaseJavaCommitActionExecuto#execute method with BaseSparkCommitActionExecutor

2021-02-21 Thread GitBox



leesf commented on a change in pull request #2560:
URL: https://github.com/apache/hudi/pull/2560#discussion_r579808883



##
File path: 
hudi-client/hudi-java-client/src/main/java/org/apache/hudi/table/action/commit/BaseJavaCommitActionExecutor.java
##
@@ -132,6 +132,11 @@ protected void updateIndex(List 
writeStatuses, HoodieWriteMetadata<
 result.setWriteStatuses(statuses);
   }
 
+  protected void updateIndexAndCommitIfNeeded(List writeStatuses, 
HoodieWriteMetadata> result) {

Review comment:
   @caidezhi Sorry for the delay, there is a PR 
https://github.com/apache/hudi/pull/2382 to support COW in Java client which 
includes the changes here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2557: [SUPPORT]Container exited with a non-zero exit code 137

2021-02-21 Thread GitBox



nsivabalan commented on issue #2557:
URL: https://github.com/apache/hudi/issues/2557#issuecomment-782851569


   To help investigate better
   - Can you post the configs you used to write to hudi.
   - Can you post a screen shot of spark stages. So that we know where its 
failing and can relate to some configs used. 
   - Can you give some rough idea of your dataset record keys. Is it completely 
random or does it have some ordering to it. what it is made of.
   - I assume you are using regular bloom as index type. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1606) HoodieJavaWriteClientExample fail with exception

2021-02-21 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1606:
--
Labels: pull-request-available sev:high user-support-issues  (was: 
pull-request-available)

> HoodieJavaWriteClientExample fail with exception
> 
>
> Key: HUDI-1606
> URL: https://issues.apache.org/jira/browse/HUDI-1606
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Dezhi Cai
>Priority: Major
>  Labels: pull-request-available, sev:high, user-support-issues
> Fix For: 0.8.0
>
> Attachments: log.txt
>
>
> run org.apache.hudi.examples.java.HoodieJavaWriteClientExample locally, it 
> fail with exception.
> system: Mac
> tested hudi branch: master,  0.7.0
>  program args: [file:///tmp/hoodie/sample-table] hoodie_rt
> root cause:
> BaseJavaCommitActionExecutor#execute method does not commit the index when it 
> is needed. 
> solution: 
> align BaseJavaCommitActionExecutor#execute method with 
> BaseSparkCommitActionExecutor, commit when it is needed
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] nsivabalan closed issue #2558: [SUPPORT] HoodieJavaWriteClientExample fail with exception

2021-02-21 Thread GitBox



nsivabalan closed issue #2558:
URL: https://github.com/apache/hudi/issues/2558


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2558: [SUPPORT] HoodieJavaWriteClientExample fail with exception

2021-02-21 Thread GitBox



nsivabalan commented on issue #2558:
URL: https://github.com/apache/hudi/issues/2558#issuecomment-782851083


   Thanks, closing this out as we have a tracking jira and PR as well. thanks 
for your contribution to help make Hudi better :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2573: Rebuild a HUDI table using the Snapshot of HUDI table with its commit timeline metadata

2021-02-21 Thread GitBox



nsivabalan commented on issue #2573:
URL: https://github.com/apache/hudi/issues/2573#issuecomment-782850853


   @bvaradar : can you respond here when you get a chance. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan edited a comment on issue #2586: [SUPPORT] - How to guarantee snapshot isolation when reading Hudi tables in S3?

2021-02-21 Thread GitBox



nsivabalan edited a comment on issue #2586:
URL: https://github.com/apache/hudi/issues/2586#issuecomment-782850685


   Hudi follows MVCC and hence there is isolation between writers and readers. 
You should not see any such issues.
   - "if any process updates the table under S3". by this you mean, if you 
update Hudi dataset via spark data source/deltastreamer etc is it? Or by some 
other means 
   - Can you post the stack trace you see. w/o any logs, going to be tough to 
debug this.
   - Can you post your configs you use to write and read from hudi. 
   - I assume you have just one writer at any point in time. Can you please 
confirm that.  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2586: [SUPPORT] - How to guarantee snapshot isolation when reading Hudi tables in S3?

2021-02-21 Thread GitBox



nsivabalan commented on issue #2586:
URL: https://github.com/apache/hudi/issues/2586#issuecomment-782850685


   Hudi follows MVCC and hence there is isolation between writers and readers. 
You should not see any such issues.
   - "if any process updates the table under S3:. by this you mean, if you 
update Hudi dataset via spark data source/deltastreamer etc is it? Or by some 
other means 
   - Can you post the stack trace you see. w/o any logs, going to be tough to 
debug this.
   - Can you post your configs you use to write and read from hudi. 
   - I assume you have just one writer at any point in time. Can you please 
confirm that.  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #2210: [HUDI-1348] Provide option to clean up DFS sources

2021-02-21 Thread GitBox



nsivabalan commented on a change in pull request #2210:
URL: https://github.com/apache/hudi/pull/2210#discussion_r579798666



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java
##
@@ -1159,6 +1163,72 @@ public void 
testCsvDFSSourceNoHeaderWithSchemaProviderAndTransformer() throws Ex
 testCsvDFSSource(false, '\t', true, 
Collections.singletonList(TripsWithDistanceTransformer.class.getName()));
   }
 
+  @Test
+  public void testDFSSourceDeleteFilesAfterCommit() throws Exception {
+TypedProperties props = new TypedProperties();
+props.setProperty("hoodie.deltastreamer.source.dfs.clean.mode", "delete");
+testDFSSourceCleanUp(props);
+  }
+
+  @Test
+  public void testDFSSourceArchiveFilesAfterCommit() throws Exception {
+TypedProperties props = new TypedProperties();
+props.setProperty("hoodie.deltastreamer.source.dfs.clean.mode", "archive");

Review comment:
   Can you please add a test for no-op as well. just to ensure with this 
patch, it is in fact no-op :) 

##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java
##
@@ -217,7 +217,7 @@ public static void initClass() throws Exception {
 
invalidHiveSyncProps.setProperty("hoodie.deltastreamer.ingestion.uber_db.dummy_table_uber.configFile",
 dfsBasePath + "/config/invalid_hive_sync_uber_config.properties");
 UtilitiesTestBase.Helpers.savePropsToDFS(invalidHiveSyncProps, dfs, 
dfsBasePath + "/" + PROPS_INVALID_HIVE_SYNC_TEST_SOURCE1);
 
-prepareParquetDFSFiles(PARQUET_NUM_RECORDS);
+prepareParquetDFSFiles(PARQUET_SOURCE_ROOT + "/1.parquet", 
PARQUET_NUM_RECORDS);

Review comment:
   you can make the sourceRoot along as an argument. 1.parquet can be left 
as abstracted within the method.

##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/testutils/sources/AbstractDFSSourceTestBase.java
##
@@ -189,4 +205,63 @@ public void testReadingFromSource() throws IOException {
 Option.empty(), Long.MAX_VALUE);
 assertEquals(10101, fetch6.getBatch().get().count());
   }
+
+  @Test
+  public void testCleanUpSourceAfterCommit() throws IOException {

Review comment:
   Not sure why we have tests in this class. These classes in testutils are 
utility classes to assist in testing and should not have any individual tests 
as such. Can you move tests from this class to a separate test class. If we 
don't hold on this, slowly it will invite everyone to add more tests to this 
class. 

##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/testutils/sources/AbstractDFSSourceTestBase.java
##
@@ -79,14 +85,24 @@ public void setup() throws Exception {
   @AfterEach
   public void teardown() throws Exception {
 super.teardown();
+dfs.delete(new Path(dfsRoot), true);
+dfs.delete(new Path(dfsArchivePath), true);
   }
 
   /**
* Prepares the specific {@link Source} to test, by passing in necessary 
configurations.
*
* @return A {@link Source} using DFS as the file system.
+   * @param defaults
*/
-  protected abstract Source prepareDFSSource();
+  protected abstract Source prepareDFSSource(TypedProperties defaults);
+
+  /**
+   * Prepares the specific {@link Source} to test.
+   */
+  protected Source prepareDFSSource() {
+return prepareDFSSource(new TypedProperties());

Review comment:
   if we follow the proposed changes in tests, we don't need to make any 
changes to these tests.

##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestHoodieDeltaStreamer.java
##
@@ -1159,6 +1163,72 @@ public void 
testCsvDFSSourceNoHeaderWithSchemaProviderAndTransformer() throws Ex
 testCsvDFSSource(false, '\t', true, 
Collections.singletonList(TripsWithDistanceTransformer.class.getName()));
   }
 
+  @Test
+  public void testDFSSourceDeleteFilesAfterCommit() throws Exception {
+TypedProperties props = new TypedProperties();
+props.setProperty("hoodie.deltastreamer.source.dfs.clean.mode", "delete");
+testDFSSourceCleanUp(props);
+  }
+
+  @Test
+  public void testDFSSourceArchiveFilesAfterCommit() throws Exception {
+TypedProperties props = new TypedProperties();
+props.setProperty("hoodie.deltastreamer.source.dfs.clean.mode", "archive");
+final String archivePath = dfsBasePath + "/archive";
+dfs.mkdirs(new Path(archivePath));
+props.setProperty("hoodie.deltastreamer.source.dfs.clean.archiveDir", 
archivePath);
+assertEquals(0, Helpers.listAllFiles(archivePath).size());
+testDFSSourceCleanUp(props);
+// archive dir should contain source files
+assertEquals(1, Helpers.listAllFiles(archivePath).size());
+  }
+
+  private void testDFSSourceCleanUp(TypedProperties props) throws Exception {
+// since each source cleanup test will modify source, we need to set 
different dfs.root each test
+

[GitHub] [hudi] nsivabalan removed a comment on pull request #2443: [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable

2021-02-21 Thread GitBox



nsivabalan removed a comment on pull request #2443:
URL: https://github.com/apache/hudi/pull/2443#issuecomment-782802631


   @vinothchandar @yanghua : A general doubt on supporting this config knob. As 
quoted above if connection fails and if we set this config of interest to true, 
then IMetaStoreClient will not be initialized within HoodieHiveClient. So, for 
all methods in this class we have to handle null.
   
   For eg createTable, getTableSchema, etc . I am not sure how to go about 
this. 
   One approach we can take. 
   for some operations we can do our best to return empty or false, or 
something of those sort. 
   For eg, getPartitionEvents, doesTableExist we can return empty list / return 
false. 
   addPartitionsToTable we can just let it as no-op. 
   But for other operations like getTableSchema() we can throw exceptions. 
   What are you thoughts. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] garyli1019 commented on a change in pull request #2581: [HUDI-1624] The state based index should bootstrap from existing base…

2021-02-21 Thread GitBox



garyli1019 commented on a change in pull request #2581:
URL: https://github.com/apache/hudi/pull/2581#discussion_r579779589



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/partitioner/BucketAssignFunction.java
##
@@ -146,5 +209,69 @@ public void notifyCheckpointComplete(long l) {
 // Refresh the table state when there are new commits.
 this.bucketAssigner.reset();
 this.bucketAssigner.refreshTable();
+checkPartitionsLoaded();
+  }
+
+  /**
+   * Load all the indices of give partition path into the backup state.
+   *
+   * @param partitionPath The partition path
+   * @throws Exception when error occurs for state update
+   */
+  private void loadRecords(String partitionPath) throws Exception {
+HoodieTable hoodieTable = bucketAssigner.getTable();
+List latestBaseFiles =
+HoodieIndexUtils.getLatestBaseFilesForPartition(partitionPath, 
context, hoodieTable);
+for (HoodieBaseFile baseFile : latestBaseFiles) {
+  List hoodieKeys =
+  ParquetUtils.fetchRecordKeyPartitionPathFromParquet(hadoopConf, new 
Path(baseFile.getPath()));
+  hoodieKeys.forEach(hoodieKey -> {
+try {
+  this.indexState.put(hoodieKey, new 
HoodieRecordLocation(baseFile.getCommitTime(), baseFile.getFileId()));
+} catch (Exception e) {
+  throw new HoodieIOException("Error when load record keys from file: 
" + baseFile);
+}
+  });
+}
+// Mark the partition path as loaded.
+partitionLoadState.put(partitionPath, 0);

Review comment:
   maybe put a boolean?

##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/partitioner/BucketAssignFunction.java
##
@@ -146,5 +209,69 @@ public void notifyCheckpointComplete(long l) {
 // Refresh the table state when there are new commits.
 this.bucketAssigner.reset();
 this.bucketAssigner.refreshTable();
+checkPartitionsLoaded();
+  }
+
+  /**
+   * Load all the indices of give partition path into the backup state.
+   *
+   * @param partitionPath The partition path
+   * @throws Exception when error occurs for state update
+   */
+  private void loadRecords(String partitionPath) throws Exception {
+HoodieTable hoodieTable = bucketAssigner.getTable();
+List latestBaseFiles =
+HoodieIndexUtils.getLatestBaseFilesForPartition(partitionPath, 
context, hoodieTable);
+for (HoodieBaseFile baseFile : latestBaseFiles) {
+  List hoodieKeys =
+  ParquetUtils.fetchRecordKeyPartitionPathFromParquet(hadoopConf, new 
Path(baseFile.getPath()));
+  hoodieKeys.forEach(hoodieKey -> {
+try {
+  this.indexState.put(hoodieKey, new 
HoodieRecordLocation(baseFile.getCommitTime(), baseFile.getFileId()));
+} catch (Exception e) {
+  throw new HoodieIOException("Error when load record keys from file: 
" + baseFile);
+}
+  });
+}
+// Mark the partition path as loaded.
+partitionLoadState.put(partitionPath, 0);
+  }
+
+  /**
+   * Checks whether all the partitions of the table are loaded into the state,
+   * set the flag {@code allPartitionsLoaded} to true if it is.
+   */
+  private void checkPartitionsLoaded() {
+for (String partition : this.allPartitionPath) {
+  try {
+if (!this.partitionLoadState.contains(partition)) {
+  return;
+}
+  } catch (Exception e) {
+LOG.warn("Error when check whether all partitions are loaded, 
ignored", e);
+throw new HoodieException(e);
+  }
+}
+this.allPartitionsLoaded = true;
+  }

Review comment:
   IIUC, this seems necessary cause we didn't update the 
`partitionLoadState` if we see a new partition in the upcoming records so we 
need to check after each commit. Otherwise, we need to update the 
`partitionLoadState` with `indexState` together.

##
File path: 
hudi-flink/src/main/java/org/apache/hudi/operator/partitioner/BucketAssignFunction.java
##
@@ -78,13 +130,14 @@ public BucketAssignFunction(Configuration conf) {
   public void open(Configuration parameters) throws Exception {
 super.open(parameters);
 HoodieWriteConfig writeConfig = 
StreamerUtil.getHoodieClientConfig(this.conf);
-HoodieFlinkEngineContext context =
-new HoodieFlinkEngineContext(
-new SerializableConfiguration(StreamerUtil.getHadoopConf()),
-new FlinkTaskContextSupplier(getRuntimeContext()));
-this.bucketAssigner = new BucketAssigner(
-context,
-writeConfig);
+this.context = new HoodieFlinkEngineContext(
+new SerializableConfiguration(StreamerUtil.getHadoopConf()),

Review comment:
   can we use `this.hadoopConf`, `getHadoopConf()` seems called twice. 

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java
##
@@ -37,6 +38,29 @@
  */
 public class HoodieIndexUtils {
 
+  /**
+   * Fetches Pair of partition path and {@link

70 matches

Mail list logo