[GitHub] [hudi] hudi-bot commented on pull request #6739: [HUDI-4851] Fixing handling of `UTF8String` w/in `InSet` operator

2022-09-21 Thread GitBox


hudi-bot commented on PR #6739:
URL: https://github.com/apache/hudi/pull/6739#issuecomment-1254563996

   
   ## CI report:
   
   * 6756f0e59418c7de7a7ca0d47a3fd2ff0427f04a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11570)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6739: [HUDI-4851] Fixing handling of `UTF8String` w/in `InSet` operator

2022-09-21 Thread GitBox


hudi-bot commented on PR #6739:
URL: https://github.com/apache/hudi/pull/6739#issuecomment-1254560139

   
   ## CI report:
   
   * 6756f0e59418c7de7a7ca0d47a3fd2ff0427f04a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6738: [HUDI-4895] Object store based lock provider

2022-09-21 Thread GitBox


hudi-bot commented on PR #6738:
URL: https://github.com/apache/hudi/pull/6738#issuecomment-1254556886

   
   ## CI report:
   
   * c0c9616166bf46216cdaf9ff8d634770e325e472 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11567)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin opened a new pull request, #6739: [HUDI-4851] Fixing handling of `UTF8String` w/in `InSet` operator

2022-09-21 Thread GitBox


alexeykudinkin opened a new pull request, #6739:
URL: https://github.com/apache/hudi/pull/6739

   ### Change Logs
   
   This is taking up the fix from https://github.com/apache/hudi/pull/6700, and 
adding the test for it
   
   ### Impact
   
   **Risk level: None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6733: [HUDI-4880] Fix corrupted parquet file issue left over by cancelled compaction task

2022-09-21 Thread GitBox


hudi-bot commented on PR #6733:
URL: https://github.com/apache/hudi/pull/6733#issuecomment-1254523155

   
   ## CI report:
   
   * fa31786d3256e2d0a40ae3c1f874d8f32a45ce82 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11566)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6734: [HUDI-3478][HUDI-4887] Use Avro as the format of persisted cdc data

2022-09-21 Thread GitBox


hudi-bot commented on PR #6734:
URL: https://github.com/apache/hudi/pull/6734#issuecomment-1254520367

   
   ## CI report:
   
   * 3d9071b62050a2b72d2522098f2b3263ddf91e40 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11554)
 
   * 06c2dca18820ac062262e38deed409ed7d7b4d2b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11569)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6737: [HUDI-4373] Flink Consistent hashing bucket index write path code

2022-09-21 Thread GitBox


hudi-bot commented on PR #6737:
URL: https://github.com/apache/hudi/pull/6737#issuecomment-1254516648

   
   ## CI report:
   
   * 5e745fc3455ec2ebdf06f1d3068d9c7a112e4987 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11565)
 
   * 63aaa03dbd85111385ce1cbf09ab5bc173a44c0f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11568)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6734: [HUDI-3478][HUDI-4887] Use Avro as the format of persisted cdc data

2022-09-21 Thread GitBox


hudi-bot commented on PR #6734:
URL: https://github.com/apache/hudi/pull/6734#issuecomment-1254516618

   
   ## CI report:
   
   * 3d9071b62050a2b72d2522098f2b3263ddf91e40 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11554)
 
   * 06c2dca18820ac062262e38deed409ed7d7b4d2b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6737: [HUDI-4373] Flink Consistent hashing bucket index write path code

2022-09-21 Thread GitBox


hudi-bot commented on PR #6737:
URL: https://github.com/apache/hudi/pull/6737#issuecomment-125451

   
   ## CI report:
   
   * 5e745fc3455ec2ebdf06f1d3068d9c7a112e4987 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11565)
 
   * 63aaa03dbd85111385ce1cbf09ab5bc173a44c0f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] eshu commented on issue #6283: [SUPPORT] No .marker files

2022-09-21 Thread GitBox


eshu commented on issue #6283:
URL: https://github.com/apache/hudi/issues/6283#issuecomment-1254494003

   @nsivabalan Workaround is working, but the bug still exists. If workaround 
is a resolution, then yes, it is resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6738: [HUDI-4895] Object store based lock provider

2022-09-21 Thread GitBox


hudi-bot commented on PR #6738:
URL: https://github.com/apache/hudi/pull/6738#issuecomment-1254476299

   
   ## CI report:
   
   * c0c9616166bf46216cdaf9ff8d634770e325e472 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11567)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6733: [HUDI-4880] Fix corrupted parquet file issue left over by cancelled compaction task

2022-09-21 Thread GitBox


hudi-bot commented on PR #6733:
URL: https://github.com/apache/hudi/pull/6733#issuecomment-1254476263

   
   ## CI report:
   
   * c7c9984860b14b40d3f716f1fc1f16dc70f548b4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11551)
 
   * fa31786d3256e2d0a40ae3c1f874d8f32a45ce82 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11566)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] IsisPolei commented on issue #6720: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieRemoteException: Connect to 192.168.64.107:34446 [/192.168.64.107] failed: Connection refused (C

2022-09-21 Thread GitBox


IsisPolei commented on issue #6720:
URL: https://github.com/apache/hudi/issues/6720#issuecomment-1254475342

   The origin problem is offline compaction. The HoodieJavaWriteClient doesn't 
support compact inline.
   
   @Override
 protected List compact(String compactionInstantTime,
 boolean shouldComplete) {
   throw new HoodieNotSupportedException("Compact is not supported in 
HoodieJavaClient");
 }
   
   So i change my hudi client to SparkRDDWriteClient. This client works like a 
treat when using spark local mode and standalone mode(in the same host machine).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6738: [HUDI-4895] Object store based lock provider

2022-09-21 Thread GitBox


hudi-bot commented on PR #6738:
URL: https://github.com/apache/hudi/pull/6738#issuecomment-1254473579

   
   ## CI report:
   
   * c0c9616166bf46216cdaf9ff8d634770e325e472 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6733: [HUDI-4880] Fix corrupted parquet file issue left over by cancelled compaction task

2022-09-21 Thread GitBox


hudi-bot commented on PR #6733:
URL: https://github.com/apache/hudi/pull/6733#issuecomment-1254473535

   
   ## CI report:
   
   * c7c9984860b14b40d3f716f1fc1f16dc70f548b4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11551)
 
   * fa31786d3256e2d0a40ae3c1f874d8f32a45ce82 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6737: [HUDI-4373] Flink Consistent hashing bucket index write path code

2022-09-21 Thread GitBox


hudi-bot commented on PR #6737:
URL: https://github.com/apache/hudi/pull/6737#issuecomment-1254470224

   
   ## CI report:
   
   * 5e745fc3455ec2ebdf06f1d3068d9c7a112e4987 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11565)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6736: [HUDI-4894] Fix ClassCastException when using fixed type defining dec…

2022-09-21 Thread GitBox


hudi-bot commented on PR #6736:
URL: https://github.com/apache/hudi/pull/6736#issuecomment-1254470203

   
   ## CI report:
   
   * 255a6aef08b5f9ee25a556baa31d5c329bd8dcfc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11564)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6284: [HUDI-4526] Improve spillableMapBasePath disk directory is full

2022-09-21 Thread GitBox


hudi-bot commented on PR #6284:
URL: https://github.com/apache/hudi/pull/6284#issuecomment-1254469770

   
   ## CI report:
   
   * 026dbfc7a6d4d7e489e8c8671a84e143bdb01758 UNKNOWN
   * 4b0a4e72766491e15dbeb8ed904c9aabae32bb89 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11563)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch release-feature-rfc46 updated: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (#6132)

2022-09-21 Thread yuzhaojing
This is an automated email from the ASF dual-hosted git repository.

yuzhaojing pushed a commit to branch release-feature-rfc46
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/release-feature-rfc46 by this 
push:
 new 41392e119f [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments 
feedback (#6132)
41392e119f is described below

commit 41392e119fcc7c7433d415d70b3800bc3dbf0e2b
Author: komao 
AuthorDate: Thu Sep 22 11:17:54 2022 +0800

[RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback (#6132)

* Update the RFC-46 doc to fix comments feedback

* fix

Co-authored-by: wangzixuan.wzxuan 
---
 rfc/rfc-46/rfc-46.md | 169 ---
 1 file changed, 134 insertions(+), 35 deletions(-)

diff --git a/rfc/rfc-46/rfc-46.md b/rfc/rfc-46/rfc-46.md
index a851a4443a..192bdbf8c6 100644
--- a/rfc/rfc-46/rfc-46.md
+++ b/rfc/rfc-46/rfc-46.md
@@ -38,7 +38,7 @@ when dealing with records (during merge, column value 
extractions, writing into
 
 While having a single format of the record representation is certainly making 
implementation of some components simpler, 
 it bears unavoidable performance penalty of de-/serialization loop: every 
record handled by Hudi has to be converted
-from (low-level) engine-specific representation (`Row` for Spark, `RowData` 
for Flink, `ArrayWritable` for Hive) into intermediate 
+from (low-level) engine-specific representation (`InternalRow` for Spark, 
`RowData` for Flink, `ArrayWritable` for Hive) into intermediate 
 one (Avro), with some operations (like clustering, compaction) potentially 
incurring this penalty multiple times (on read- 
 and write-paths). 
 
@@ -84,59 +84,105 @@ is known to have poor performance (compared to 
non-reflection based instantiatio
 
  Record Merge API
 
-Stateless component interface providing for API Combining Records will look 
like following:
+CombineAndGetUpdateValue and Precombine will converge to one API. Stateless 
component interface providing for API Combining Records will look like 
following:
 
 ```java
-interface HoodieMerge {
-   HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
-
-   Option combineAndGetUpdateValue(HoodieRecord older, 
HoodieRecord newer, Schema schema, Properties props) throws IOException;
-}
+interface HoodieRecordMerger {
 
/**
-* Spark-specific implementation 
+* The kind of merging strategy this recordMerger belongs to. A UUID 
represents merging strategy.
 */
-   class HoodieSparkRecordMerge implements HoodieMerge {
+   String getMergingStrategy();
+  
+   // This method converges combineAndGetUpdateValue and precombine from 
HoodiePayload. 
+   // It'd be associative operation: f(a, f(b, c)) = f(f(a, b), c) (which we 
can translate as having 3 versions A, B, C of the single record, both orders of 
operations applications have to yield the same result)
+   Option merge(HoodieRecord older, HoodieRecord newer, Schema 
schema, Properties props) throws IOException;
+   
+   // The record type handled by the current merger
+   // SPARK, AVRO, FLINK
+   HoodieRecordType getRecordType();
+}
 
-  @Override
-  public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-// HoodieSparkRecords preCombine
-  }
+/**
+ * Spark-specific implementation 
+ */
+class HoodieSparkRecordMerger implements HoodieRecordMerger {
+
+  @Override
+  public String getMergingStrategy() {
+return UUID_MERGER_STRATEGY;
+  }
+  
+   @Override
+   Option merge(HoodieRecord older, HoodieRecord newer, Schema 
schema, Properties props) throws IOException {
+ // HoodieSparkRecord precombine and combineAndGetUpdateValue. It'd be 
associative operation.
+   }
 
-  @Override
-  public Option combineAndGetUpdateValue(HoodieRecord older, 
HoodieRecord newer, Schema schema, Properties props) {
- // HoodieSparkRecord combineAndGetUpdateValue
-  }
+   @Override
+   HoodieRecordType getRecordType() {
+ return HoodieRecordType.SPARK;
}
+}

-   /**
-* Flink-specific implementation 
-*/
-   class HoodieFlinkRecordMerge implements HoodieMerge {
-
-  @Override
-  public HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer) {
-// HoodieFlinkRecord preCombine
-  }
+/**
+ * Flink-specific implementation 
+ */
+class HoodieFlinkRecordMerger implements HoodieRecordMerger {
+
+   @Override
+   public String getMergingStrategy() {
+  return UUID_MERGER_STRATEGY;
+   }
+  
+   @Override
+   Option merge(HoodieRecord older, HoodieRecord newer, Schema 
schema, Properties props) throws IOException {
+  // HoodieFlinkRecord precombine and combineAndGetUpdateValue. It'd be 
associative operation.
+   }
 
-  @Override
-  public Option combineAndGetUpdateValue(HoodieRecord older, 
HoodieRecord newer, Schema schema, Properties props) {
- // HoodieFlinkRecord 

[GitHub] [hudi] yuzhaojing merged pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

2022-09-21 Thread GitBox


yuzhaojing merged PR #6132:
URL: https://github.com/apache/hudi/pull/6132


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yuzhaojing merged pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-09-21 Thread GitBox


yuzhaojing merged PR #5629:
URL: https://github.com/apache/hudi/pull/5629


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] IsisPolei commented on issue #6720: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieRemoteException: Connect to 192.168.64.107:34446 [/192.168.64.107] failed: Connection refused (C

2022-09-21 Thread GitBox


IsisPolei commented on issue #6720:
URL: https://github.com/apache/hudi/issues/6720#issuecomment-125445

   I think the main reason of this problem is that my app(where 
SparkRDDWriteClient process hudi data) and the spark cluster which 
SparkRDDWriteClient connected are deployed in different local machine. When 
both docker containers run in the same host machine everything work well since 
the containers can connect to each other with docker bridge network(As the hudi 
docker demo is also one of this scenario). So i'm trying to find out how 
exactly hudi and spark connect to each other during this process. First i 
thought if the HoodieSparkEngineContext init successfully the connection part 
is over. Apparently there is something more. For example the timeline server 
and remoteFilySystemView also should be reachable because the application will 
be running in the spark worker node. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4895) Object Store based lock provider

2022-09-21 Thread Yuwei Xiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuwei Xiao updated HUDI-4895:
-
Component/s: multi-writer

> Object Store based lock provider
> 
>
> Key: HUDI-4895
> URL: https://issues.apache.org/jira/browse/HUDI-4895
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: multi-writer
>Reporter: Yuwei Xiao
>Assignee: Yuwei Xiao
>Priority: Major
>  Labels: pull-request-available
>
> Currently, we have `FileSystemBasedLockProvier`, which relies on the atomic 
> guarantee of the underlying file system. Specifically, only with filesystem's 
> atomic rename & atomic create capability, the LockProvider can work properly.
>  
> However, many of hudi users use object store (e.g., S3, OSS). So we wants to 
> implement an object store based lock provider.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4812) Lazy partition listing and file groups fetching in Spark Query

2022-09-21 Thread Yuwei Xiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuwei Xiao updated HUDI-4812:
-
Component/s: spark

> Lazy partition listing and file groups fetching in Spark Query
> --
>
> Key: HUDI-4812
> URL: https://issues.apache.org/jira/browse/HUDI-4812
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Yuwei Xiao
>Assignee: Yuwei Xiao
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> In current spark query implementation, the FileIndex will refresh and load 
> all file groups in cached in order to serve subsequent queries.
>  
> For large table with many partitions, this may introduce much overhead in 
> initialization. Meanwhile, the query itself may come with partition filter. 
> So the loading of file groups will be unnecessary.
>  
> So to optimize, the whole refresh logic will become lazy, where actual work 
> will be carried out only after the partition filter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4896) Consistent hashing index resizing for Flink Engine

2022-09-21 Thread Yuwei Xiao (Jira)
Yuwei Xiao created HUDI-4896:


 Summary: Consistent hashing index resizing for Flink Engine
 Key: HUDI-4896
 URL: https://issues.apache.org/jira/browse/HUDI-4896
 Project: Apache Hudi
  Issue Type: Improvement
  Components: clustering, index
Reporter: Yuwei Xiao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4895) Object Store based lock provider

2022-09-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4895:
-
Labels: pull-request-available  (was: )

> Object Store based lock provider
> 
>
> Key: HUDI-4895
> URL: https://issues.apache.org/jira/browse/HUDI-4895
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yuwei Xiao
>Assignee: Yuwei Xiao
>Priority: Major
>  Labels: pull-request-available
>
> Currently, we have `FileSystemBasedLockProvier`, which relies on the atomic 
> guarantee of the underlying file system. Specifically, only with filesystem's 
> atomic rename & atomic create capability, the LockProvider can work properly.
>  
> However, many of hudi users use object store (e.g., S3, OSS). So we wants to 
> implement an object store based lock provider.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] YuweiXiao opened a new pull request, #6738: [HUDI-4895] Object store based lock provider

2022-09-21 Thread GitBox


YuweiXiao opened a new pull request, #6738:
URL: https://github.com/apache/hudi/pull/6738

   ### Change Logs
   
   Currently, we have `FileSystemBasedLockProvier`, which relies on the atomic 
guarantee of the underlying file system. Specifically, only with filesystem's 
atomic rename & atomic create capability, the LockProvider can work properly.
   
This PR enables Object store (e.g, AliyunOSS) as a lock provider.
   
   ### Impact
   
   No API change.
   
   **Risk level: none | low | medium | high**
   
   LOW.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-4895) Object Store based lock provider

2022-09-21 Thread Yuwei Xiao (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuwei Xiao reassigned HUDI-4895:


Assignee: Yuwei Xiao

> Object Store based lock provider
> 
>
> Key: HUDI-4895
> URL: https://issues.apache.org/jira/browse/HUDI-4895
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yuwei Xiao
>Assignee: Yuwei Xiao
>Priority: Major
>
> Currently, we have `FileSystemBasedLockProvier`, which relies on the atomic 
> guarantee of the underlying file system. Specifically, only with filesystem's 
> atomic rename & atomic create capability, the LockProvider can work properly.
>  
> However, many of hudi users use object store (e.g., S3, OSS). So we wants to 
> implement an object store based lock provider.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] loukey-lj commented on pull request #6704: [HUDI-4780] improve test setup

2022-09-21 Thread GitBox


loukey-lj commented on PR #6704:
URL: https://github.com/apache/hudi/pull/6704#issuecomment-1254445377

   > 
   @xushiyan thank u,this pr supplements the 
[6602](https://github.com/apache/hudi/pull/6602) test case. You can first look 
at the review record of  [6602](https://github.com/apache/hudi/pull/6602) .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xicm commented on a diff in pull request #6715: [HUDI-3983] ClassNotFoundException when using hudi-spark-bundle to write table with hbase index

2022-09-21 Thread GitBox


xicm commented on code in PR #6715:
URL: https://github.com/apache/hudi/pull/6715#discussion_r977131538


##
hudi-common/src/main/resources/hbase-site.xml:
##
@@ -1699,13 +1699,6 @@ possible configurations would overwhelm and obscure the 
important.
   Implementation of the status publication with a multicast message.
 
   
-  
-hbase.status.listener.class
-
org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener

Review Comment:
   I mean we use shaded name with bundle jar only. If the dependence we use is 
hudi-common and the listener class comes from original hbase-client, in this 
case we will get an exception.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6737: [HUDI-4373] Flink Consistent hashing bucket index write path code

2022-09-21 Thread GitBox


hudi-bot commented on PR #6737:
URL: https://github.com/apache/hudi/pull/6737#issuecomment-1254433425

   
   ## CI report:
   
   * 5e745fc3455ec2ebdf06f1d3068d9c7a112e4987 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4373) Consistent bucket index write path for Flink engine

2022-09-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4373:
-
Labels: pull-request-available  (was: )

> Consistent bucket index write path for Flink engine
> ---
>
> Key: HUDI-4373
> URL: https://issues.apache.org/jira/browse/HUDI-4373
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink, index
>Reporter: Yuwei Xiao
>Assignee: Yuwei Xiao
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Simple bucket index (with fixed bucket number) is ready for Flink engine and 
> has been used widely in the community. 
> Since spark now support consistent bucket (dynamic bucket number), we should 
> bridge the gap and bring this feature to Flink too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] YuweiXiao opened a new pull request, #6737: [HUDI-4373] Flink Consistent hashing bucket index write path code

2022-09-21 Thread GitBox


YuweiXiao opened a new pull request, #6737:
URL: https://github.com/apache/hudi/pull/6737

   ### Change Logs
   
   Implement consistent hashing bucket index for flink. This PR only covers the 
write core of the index, and the resizing implementation will be in another PR.
   
   There are three main changes:
   - Extract common code of consistent hashing bucket index, to serve both 
Spark engine.
   - Have Flink engine write path adapt to consistent hashing bucket index, 
e.g., introduce `ConsistentBucketStreamWriteOperator `
   - Introduce the basic framework of `UpdateStrategy` for Flink, to handle 
conflict between concurrent clustering & update.
   
   ### Impact
   
   No public API change.
   
   **Risk level: none | low | medium | high**
   
   Low
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x]  Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6736: [HUDI-4894] Fix ClassCastException when using fixed type defining dec…

2022-09-21 Thread GitBox


hudi-bot commented on PR #6736:
URL: https://github.com/apache/hudi/pull/6736#issuecomment-1254427155

   
   ## CI report:
   
   * 255a6aef08b5f9ee25a556baa31d5c329bd8dcfc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11564)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6736: [HUDI-4894] Fix ClassCastException when using fixed type defining dec…

2022-09-21 Thread GitBox


hudi-bot commented on PR #6736:
URL: https://github.com/apache/hudi/pull/6736#issuecomment-1254423537

   
   ## CI report:
   
   * 255a6aef08b5f9ee25a556baa31d5c329bd8dcfc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6735: [HUDI-4892] Fix hudi-spark3-bundle

2022-09-21 Thread GitBox


hudi-bot commented on PR #6735:
URL: https://github.com/apache/hudi/pull/6735#issuecomment-1254419744

   
   ## CI report:
   
   * 51c0c21c9f5a689943147a1faded74c67fef61a2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11562)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xicm commented on a diff in pull request #6715: [HUDI-3983] ClassNotFoundException when using hudi-spark-bundle to write table with hbase index

2022-09-21 Thread GitBox


xicm commented on code in PR #6715:
URL: https://github.com/apache/hudi/pull/6715#discussion_r977117915


##
hudi-common/src/main/resources/hbase-site.xml:
##
@@ -1699,13 +1699,6 @@ possible configurations would overwhelm and obscure the 
important.
   Implementation of the status publication with a multicast message.
 
   
-  
-hbase.status.listener.class
-
org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener

Review Comment:
   @yihua If we rename the class with the shaded name, there will be a 
ClassNotFoundException when referencing hudi-common.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-4895) Object Store based lock provider

2022-09-21 Thread Yuwei Xiao (Jira)
Yuwei Xiao created HUDI-4895:


 Summary: Object Store based lock provider
 Key: HUDI-4895
 URL: https://issues.apache.org/jira/browse/HUDI-4895
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Yuwei Xiao


Currently, we have `FileSystemBasedLockProvier`, which relies on the atomic 
guarantee of the underlying file system. Specifically, only with filesystem's 
atomic rename & atomic create capability, the LockProvider can work properly.

 

However, many of hudi users use object store (e.g., S3, OSS). So we wants to 
implement an object store based lock provider.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4894) Fix ClassCastException when using fixed type defining decimal column

2022-09-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4894:
-
Labels: pull-request-available  (was: )

> Fix ClassCastException when using fixed type defining decimal column
> 
>
> Key: HUDI-4894
> URL: https://issues.apache.org/jira/browse/HUDI-4894
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> schema for decimal column :
> {code:java}
> {
>     "name": "column_name",
>     "type": ["null", {
>         "type": "fixed",
>         "name": "fixed",
>         "size": 5,
>         "logicalType": "decimal",
>         "precision": 10,
>         "scale": 2
>     }],
>     "default": null
> }{code}
>  
> exception:
> Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
> java.util.List
> at 
> org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254)
> at 
> org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96)
> at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] wangxianghu opened a new pull request, #6736: [HUDI-4894] Fix ClassCastException when using fixed type defining dec…

2022-09-21 Thread GitBox


wangxianghu opened a new pull request, #6736:
URL: https://github.com/apache/hudi/pull/6736

   …imal column
   
   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4894) Fix ClassCastException when using fixed type defining decimal column

2022-09-21 Thread Xianghu Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghu Wang updated HUDI-4894:
---
Description: 
schema for decimal column :
{code:java}
{
    "name": "column_name",
    "type": ["null", {
        "type": "fixed",
        "name": "fixed",
        "size": 5,
        "logicalType": "decimal",
        "precision": 10,
        "scale": 2
    }],
    "default": null
}{code}
 

exception:

Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
java.util.List
at 
org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254)
at 
org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107)
at 
org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96)
at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs

  was:
schema for decimal column :

{
    "name": "column_name",
    "type": ["null",{

    "type": "fixed",        

    "name": "fixed",        

    "size": 5,        

    "logicalType": "decimal",        

    "precision": 10,        

    "scale": 2     }],
    "default": null
}

 

exception:

Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
java.util.List
at 
org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254)
at 
org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107)
at 
org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96)
at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs


> Fix ClassCastException when using fixed type defining decimal column
> 
>
> Key: HUDI-4894
> URL: https://issues.apache.org/jira/browse/HUDI-4894
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
> Fix For: 0.12.1
>
>
> schema for decimal column :
> {code:java}
> {
>     "name": "column_name",
>     "type": ["null", {
>         "type": "fixed",
>         "name": "fixed",
>         "size": 5,
>         "logicalType": "decimal",
>         "precision": 10,
>         "scale": 2
>     }],
>     "default": null
> }{code}
>  
> exception:
> Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
> java.util.List
> at 
> org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254)
> at 
> org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96)
> at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4894) Fix ClassCastException when using fixed type defining decimal column

2022-09-21 Thread Xianghu Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghu Wang updated HUDI-4894:
---
Description: 
schema for decimal column :

{
    "name": "column_name",
    "type": ["null",{

    "type": "fixed",        

    "name": "fixed",        

    "size": 5,        

    "logicalType": "decimal",        

    "precision": 10,        

    "scale": 2     }],
    "default": null
}

 

exception:

Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
java.util.List
at 
org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254)
at 
org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107)
at 
org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96)
at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs

  was:
schema for decimal column :

{
    "name": "column_name",
    "type": ["null", {
        "type": "fixed",
        "name": "fixed",
        "size": 5,
        "logicalType": "decimal",
        "precision": 10,
        "scale": 2
    }],
    "default": null
}

 

exception:

Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
java.util.List
at 
org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254)
at 
org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107)
at 
org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96)
at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs


> Fix ClassCastException when using fixed type defining decimal column
> 
>
> Key: HUDI-4894
> URL: https://issues.apache.org/jira/browse/HUDI-4894
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
> Fix For: 0.12.1
>
>
> schema for decimal column :
> {
>     "name": "column_name",
>     "type": ["null",{
>     "type": "fixed",        
>     "name": "fixed",        
>     "size": 5,        
>     "logicalType": "decimal",        
>     "precision": 10,        
>     "scale": 2     }],
>     "default": null
> }
>  
> exception:
> Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
> java.util.List
> at 
> org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254)
> at 
> org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96)
> at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4894) Fix ClassCastException when using fixed type defining decimal column

2022-09-21 Thread Xianghu Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghu Wang updated HUDI-4894:
---
Description: 
schema for decimal column :

{
    "name": "column_name",
    "type": ["null", {
        "type": "fixed",
        "name": "fixed",
        "size": 5,
        "logicalType": "decimal",
        "precision": 10,
        "scale": 2
    }],
    "default": null
}

 

exception:

Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
java.util.List
at 
org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254)
at 
org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107)
at 
org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96)
at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs

  was:
schema for decimal column :

{
        "name": "decimal_column_name",
        "type": ["null", {
            "type": "fixed",
            "name": "fixed",
            "size": 5,
            "logicalType": "decimal",
            "precision": 10,
            "scale": 2
        }],
        "default": null
    }

 

exception:

Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
java.util.List
at 
org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254)
at 
org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107)
at 
org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96)
at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs


> Fix ClassCastException when using fixed type defining decimal column
> 
>
> Key: HUDI-4894
> URL: https://issues.apache.org/jira/browse/HUDI-4894
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: core
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
> Fix For: 0.12.1
>
>
> schema for decimal column :
> {
>     "name": "column_name",
>     "type": ["null", {
>         "type": "fixed",
>         "name": "fixed",
>         "size": 5,
>         "logicalType": "decimal",
>         "precision": 10,
>         "scale": 2
>     }],
>     "default": null
> }
>  
> exception:
> Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
> java.util.List
> at 
> org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254)
> at 
> org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107)
> at 
> org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96)
> at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4894) Fix ClassCastException when using fixed type defining decimal column

2022-09-21 Thread Xianghu Wang (Jira)
Xianghu Wang created HUDI-4894:
--

 Summary: Fix ClassCastException when using fixed type defining 
decimal column
 Key: HUDI-4894
 URL: https://issues.apache.org/jira/browse/HUDI-4894
 Project: Apache Hudi
  Issue Type: Bug
  Components: core
Reporter: Xianghu Wang
Assignee: Xianghu Wang
 Fix For: 0.12.1


schema for decimal column :

{
        "name": "decimal_column_name",
        "type": ["null", {
            "type": "fixed",
            "name": "fixed",
            "size": 5,
            "logicalType": "decimal",
            "precision": 10,
            "scale": 2
        }],
        "default": null
    }

 

exception:

Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
java.util.List
at 
org.apache.hudi.avro.MercifulJsonConverter$9.convert(MercifulJsonConverter.java:254)
at 
org.apache.hudi.avro.MercifulJsonConverter$JsonToAvroFieldProcessor.convertToAvro(MercifulJsonConverter.java:151)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvroField(MercifulJsonConverter.java:140)
at 
org.apache.hudi.avro.MercifulJsonConverter.convertJsonToAvro(MercifulJsonConverter.java:107)
at 
org.apache.hudi.avro.MercifulJsonConverter.convert(MercifulJsonConverter.java:96)
at org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6284: [HUDI-4526] Improve spillableMapBasePath disk directory is full

2022-09-21 Thread GitBox


hudi-bot commented on PR #6284:
URL: https://github.com/apache/hudi/pull/6284#issuecomment-1254378397

   
   ## CI report:
   
   * 026dbfc7a6d4d7e489e8c8671a84e143bdb01758 UNKNOWN
   * 0ea0766862c16ccec08c7c621f98ca8402f772ff Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10571)
 
   * 4b0a4e72766491e15dbeb8ed904c9aabae32bb89 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11563)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6697: [HUDI-3478] Implement CDC Write in Spark

2022-09-21 Thread GitBox


danny0405 commented on code in PR #6697:
URL: https://github.com/apache/hudi/pull/6697#discussion_r977092086


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java:
##
@@ -0,0 +1,253 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.cdc.HoodieCDCOperation;
+import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode;
+import org.apache.hudi.common.table.cdc.HoodieCDCUtils;
+import org.apache.hudi.common.table.log.AppendResult;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.HoodieUpsertException;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * This class encapsulates all the cdc-writing functions.
+ */
+public class HoodieCDCLogger implements Closeable {
+
+  private final String commitTime;
+
+  private final String keyField;
+
+  private final Schema dataSchema;
+
+  private final boolean populateMetaFields;
+
+  // writer for cdc data
+  private final HoodieLogFormat.Writer cdcWriter;
+
+  private final boolean cdcEnabled;
+
+  private final HoodieCDCSupplementalLoggingMode cdcSupplementalLoggingMode;
+
+  private final Schema cdcSchema;
+
+  private final String cdcSchemaString;
+
+  // the cdc data
+  private final Map cdcData;
+
+  public HoodieCDCLogger(
+  String commitTime,
+  HoodieWriteConfig config,
+  HoodieTableConfig tableConfig,
+  Schema schema,
+  HoodieLogFormat.Writer cdcWriter,
+  long maxInMemorySizeInBytes) {
+try {
+  this.commitTime = commitTime;
+  this.dataSchema = HoodieAvroUtils.removeMetadataFields(schema);
+  this.populateMetaFields = config.populateMetaFields();
+  this.keyField = populateMetaFields ? 
HoodieRecord.RECORD_KEY_METADATA_FIELD
+  : tableConfig.getRecordKeyFieldProp();
+  this.cdcWriter = cdcWriter;
+
+  this.cdcEnabled = 
config.getBooleanOrDefault(HoodieTableConfig.CDC_ENABLED);
+  this.cdcSupplementalLoggingMode = HoodieCDCSupplementalLoggingMode.parse(

Review Comment:
   Is the `cdcEnabled` flag always true here ? Because this is a cdc logger.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #5341: [HUDI-3919] [UBER] Support out of order rollback blocks in AbstractHoodieLogRecordReader

2022-09-21 Thread GitBox


nsivabalan commented on PR #5341:
URL: https://github.com/apache/hudi/pull/5341#issuecomment-1254376340

   Closing in favor of https://github.com/apache/hudi/pull/5958
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] suryaprasanna closed pull request #5341: [HUDI-3919] [UBER] Support out of order rollback blocks in AbstractHoodieLogRecordReader

2022-09-21 Thread GitBox


suryaprasanna closed pull request #5341: [HUDI-3919] [UBER] Support out of 
order rollback blocks in AbstractHoodieLogRecordReader
URL: https://github.com/apache/hudi/pull/5341


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6284: [HUDI-4526] Improve spillableMapBasePath disk directory is full

2022-09-21 Thread GitBox


hudi-bot commented on PR #6284:
URL: https://github.com/apache/hudi/pull/6284#issuecomment-1254375965

   
   ## CI report:
   
   * 026dbfc7a6d4d7e489e8c8671a84e143bdb01758 UNKNOWN
   * 0ea0766862c16ccec08c7c621f98ca8402f772ff Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10571)
 
   * 4b0a4e72766491e15dbeb8ed904c9aabae32bb89 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4015: [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data

2022-09-21 Thread GitBox


hudi-bot commented on PR #4015:
URL: https://github.com/apache/hudi/pull/4015#issuecomment-1254374818

   
   ## CI report:
   
   * e1cf530fbae41de33cb9cc76a16a2e6dc5425837 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11560)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6697: [HUDI-3478] Implement CDC Write in Spark

2022-09-21 Thread GitBox


danny0405 commented on code in PR #6697:
URL: https://github.com/apache/hudi/pull/6697#discussion_r977089707


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java:
##
@@ -93,13 +94,18 @@ public void write(GenericRecord oldRecord) {
 throw new HoodieUpsertException("Insert/Update not in sorted order");
   }
   try {
+Option insertRecord;
 if (useWriterSchemaForCompaction) {
-  writeRecord(hoodieRecord, 
hoodieRecord.getData().getInsertValue(tableSchemaWithMetaFields, 
config.getProps()));
+  insertRecord = 
hoodieRecord.getData().getInsertValue(tableSchemaWithMetaFields, 
config.getProps());
 } else {
-  writeRecord(hoodieRecord, 
hoodieRecord.getData().getInsertValue(tableSchema, config.getProps()));
+  insertRecord = hoodieRecord.getData().getInsertValue(tableSchema, 
config.getProps());
 }
+writeRecord(hoodieRecord, insertRecord);
 insertRecordsWritten++;
 writtenRecordKeys.add(keyToPreWrite);
+if (cdcEnabled) {
+  cdcLogger.put(hoodieRecord, null, insertRecord);
+}

Review Comment:
   If sorted merge deserves a sub-class, we should follow that and give cdc 
feature a sub-class too, I would see it as a prove that we should keep good 
expansibility for different use cases and components.
   
   >  it will quickly become unmanageable
   
   Why you call it unmanageable if we only add 2 classes here ? I didn't feel 
that based on the fact that I already added 5 handles for flink. We can manage 
them because they are instantiated in a factory in the base commit executor.
   
   BTW, Imagine what a mess the code is if i put these flink logic into the 
base handles. But i agree we need some refactoring to the base handles but that 
should be very small.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4893) More than 1 splits are created for a single log file for MOR table

2022-09-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4893:
--
Status: In Progress  (was: Open)

> More than 1 splits are created for a single log file for MOR table
> --
>
> Key: HUDI-4893
> URL: https://issues.apache.org/jira/browse/HUDI-4893
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.12.1
>
>
> While debugging a flaky test, realized that we are generating more than 1 
> split for one log file itself. Root caused it to isSpllitable() that returns 
> true for HoodieRealTimePath. 
>  
> [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91]
>  
> I made a quick fix locally and verified that only one split is generated per 
> log file. 
>  
> {code:java}
> git diff 
> hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> diff --git 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
>  
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> index bba44d5c66..d09dfdf753 100644
> --- 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> +++ 
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path {
>}
>  
>public boolean isSplitable() {
> -return !toString().isEmpty() && !includeBootstrapFilePath();
> +return !toString().contains(".log") && !includeBootstrapFilePath();
>}
>  
>public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { 
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4884) Fix website docs for default index type in hudi

2022-09-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4884:
--
Reviewers: Ethan Guo

> Fix website docs for default index type in hudi
> ---
>
> Key: HUDI-4884
> URL: https://issues.apache.org/jira/browse/HUDI-4884
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> [https://hudi.apache.org/docs/faq#how-does-the-hudi-indexing-work--what-are-its-benefits]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4893) More than 1 splits are created for a single log file for MOR table

2022-09-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4893:
--
Story Points: 2

> More than 1 splits are created for a single log file for MOR table
> --
>
> Key: HUDI-4893
> URL: https://issues.apache.org/jira/browse/HUDI-4893
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.12.1
>
>
> While debugging a flaky test, realized that we are generating more than 1 
> split for one log file itself. Root caused it to isSpllitable() that returns 
> true for HoodieRealTimePath. 
>  
> [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91]
>  
> I made a quick fix locally and verified that only one split is generated per 
> log file. 
>  
> {code:java}
> git diff 
> hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> diff --git 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
>  
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> index bba44d5c66..d09dfdf753 100644
> --- 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> +++ 
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path {
>}
>  
>public boolean isSplitable() {
> -return !toString().isEmpty() && !includeBootstrapFilePath();
> +return !toString().contains(".log") && !includeBootstrapFilePath();
>}
>  
>public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { 
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4848) Fix tooling for deprecated partition

2022-09-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4848:
--
Reviewers: Raymond Xu

> Fix tooling for deprecated partition 
> -
>
> Key: HUDI-4848
> URL: https://issues.apache.org/jira/browse/HUDI-4848
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> hudi cli has support to fix deprecated partition. but it assume "string" 
> datatype for the partitioning col. We might have to fix that assumption. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HUDI-4893) More than 1 splits are created for a single log file for MOR table

2022-09-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-4893:
-

Assignee: sivabalan narayanan

> More than 1 splits are created for a single log file for MOR table
> --
>
> Key: HUDI-4893
> URL: https://issues.apache.org/jira/browse/HUDI-4893
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.12.1
>
>
> While debugging a flaky test, realized that we are generating more than 1 
> split for one log file itself. Root caused it to isSpllitable() that returns 
> true for HoodieRealTimePath. 
>  
> [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91]
>  
> I made a quick fix locally and verified that only one split is generated per 
> log file. 
>  
> {code:java}
> git diff 
> hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> diff --git 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
>  
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> index bba44d5c66..d09dfdf753 100644
> --- 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> +++ 
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path {
>}
>  
>public boolean isSplitable() {
> -return !toString().isEmpty() && !includeBootstrapFilePath();
> +return !toString().contains(".log") && !includeBootstrapFilePath();
>}
>  
>public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { 
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4893) More than 1 splits are created for a single log file for MOR table

2022-09-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4893:
--
Sprint: 2022/09/19

> More than 1 splits are created for a single log file for MOR table
> --
>
> Key: HUDI-4893
> URL: https://issues.apache.org/jira/browse/HUDI-4893
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.12.1
>
>
> While debugging a flaky test, realized that we are generating more than 1 
> split for one log file itself. Root caused it to isSpllitable() that returns 
> true for HoodieRealTimePath. 
>  
> [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91]
>  
> I made a quick fix locally and verified that only one split is generated per 
> log file. 
>  
> {code:java}
> git diff 
> hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> diff --git 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
>  
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> index bba44d5c66..d09dfdf753 100644
> --- 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> +++ 
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path {
>}
>  
>public boolean isSplitable() {
> -return !toString().isEmpty() && !includeBootstrapFilePath();
> +return !toString().contains(".log") && !includeBootstrapFilePath();
>}
>  
>public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { 
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4893) More than 1 splits are created for a single log file for MOR table

2022-09-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4893:
--
Fix Version/s: 0.12.1

> More than 1 splits are created for a single log file for MOR table
> --
>
> Key: HUDI-4893
> URL: https://issues.apache.org/jira/browse/HUDI-4893
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.12.1
>
>
> While debugging a flaky test, realized that we are generating more than 1 
> split for one log file itself. Root caused it to isSpllitable() that returns 
> true for HoodieRealTimePath. 
>  
> [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91]
>  
> I made a quick fix locally and verified that only one split is generated per 
> log file. 
>  
> {code:java}
> git diff 
> hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> diff --git 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
>  
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> index bba44d5c66..d09dfdf753 100644
> --- 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> +++ 
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path {
>}
>  
>public boolean isSplitable() {
> -return !toString().isEmpty() && !includeBootstrapFilePath();
> +return !toString().contains(".log") && !includeBootstrapFilePath();
>}
>  
>public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { 
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4893) More than 1 splits are created for a single log file for MOR table

2022-09-21 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-4893:
-

 Summary: More than 1 splits are created for a single log file for 
MOR table
 Key: HUDI-4893
 URL: https://issues.apache.org/jira/browse/HUDI-4893
 Project: Apache Hudi
  Issue Type: Bug
  Components: reader-core
Reporter: sivabalan narayanan


While debugging a flaky test, realized that we are generating more than 1 split 
for one log file itself. Root caused it to isSpllitable() that returns true for 
HoodieRealTimePath. 

 

[https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91]

 

I made a quick fix locally and verified that only one split is generated per 
log file. 

 
{code:java}
git diff 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
diff --git 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
index bba44d5c66..d09dfdf753 100644
--- 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
+++ 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
@@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path {
   }
 
   public boolean isSplitable() {
-return !toString().isEmpty() && !includeBootstrapFilePath();
+return !toString().contains(".log") && !includeBootstrapFilePath();
   }
 
   public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4893) More than 1 splits are created for a single log file for MOR table

2022-09-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4893:
--
Priority: Blocker  (was: Major)

> More than 1 splits are created for a single log file for MOR table
> --
>
> Key: HUDI-4893
> URL: https://issues.apache.org/jira/browse/HUDI-4893
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Blocker
>
> While debugging a flaky test, realized that we are generating more than 1 
> split for one log file itself. Root caused it to isSpllitable() that returns 
> true for HoodieRealTimePath. 
>  
> [https://github.com/apache/hudi/blob/6dbe2960f2eaf0408dc0ef544991cad0190050a9/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java#L91]
>  
> I made a quick fix locally and verified that only one split is generated per 
> log file. 
>  
> {code:java}
> git diff 
> hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> diff --git 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
>  
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> index bba44d5c66..d09dfdf753 100644
> --- 
> a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> +++ 
> b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieRealtimePath.java
> @@ -89,7 +89,7 @@ public class HoodieRealtimePath extends Path {
>}
>  
>public boolean isSplitable() {
> -return !toString().isEmpty() && !includeBootstrapFilePath();
> +return !toString().contains(".log") && !includeBootstrapFilePath();
>}
>  
>public PathWithBootstrapFileStatus getPathWithBootstrapFileStatus() { 
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on pull request #6284: [HUDI-4526] Improve spillableMapBasePath disk directory is full

2022-09-21 Thread GitBox


nsivabalan commented on PR #6284:
URL: https://github.com/apache/hudi/pull/6284#issuecomment-1254371638

   @xushiyan : can you review this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6697: [HUDI-3478] Implement CDC Write in Spark

2022-09-21 Thread GitBox


danny0405 commented on code in PR #6697:
URL: https://github.com/apache/hudi/pull/6697#discussion_r977087389


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -292,6 +315,9 @@ protected void writeInsertRecord(HoodieRecord 
hoodieRecord) throws IOExceptio
   return;
 }
 if (writeRecord(hoodieRecord, insertRecord, 
HoodieOperation.isDelete(hoodieRecord.getOperation( {
+  if (cdcEnabled) {
+cdcLogger.put(hoodieRecord, null, insertRecord);
+  }

Review Comment:
   What do you mean for `deserialized twice`, just overwride the `writeRecord` 
method and add the cdc logger logic should work here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4892) Fix hudi-spark3-bundle

2022-09-21 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4892:

Sprint: 2022/09/19

> Fix hudi-spark3-bundle
> --
>
> Key: HUDI-4892
> URL: https://issues.apache.org/jira/browse/HUDI-4892
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Using hudi-spark3-bundle with Spark 3.3 shell, the following exception is 
> thrown.  Some classes are not packaged into the bundle.
> {code:java}
> scala> val df = spark.read.format("hudi").load("")
> java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.hudi.Spark32PlusDefaultSource not found
>   at java.util.ServiceLoader.fail(ServiceLoader.java:239)
>   at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
>   at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
>   at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>   at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
>   at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
>   at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
>   ... 47 elided {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4892) Fix hudi-spark3-bundle

2022-09-21 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4892:

Status: Patch Available  (was: In Progress)

> Fix hudi-spark3-bundle
> --
>
> Key: HUDI-4892
> URL: https://issues.apache.org/jira/browse/HUDI-4892
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Using hudi-spark3-bundle with Spark 3.3 shell, the following exception is 
> thrown.  Some classes are not packaged into the bundle.
> {code:java}
> scala> val df = spark.read.format("hudi").load("")
> java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.hudi.Spark32PlusDefaultSource not found
>   at java.util.ServiceLoader.fail(ServiceLoader.java:239)
>   at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
>   at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
>   at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>   at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
>   at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
>   at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
>   ... 47 elided {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4892) Fix hudi-spark3-bundle

2022-09-21 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4892:

Status: In Progress  (was: Open)

> Fix hudi-spark3-bundle
> --
>
> Key: HUDI-4892
> URL: https://issues.apache.org/jira/browse/HUDI-4892
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Using hudi-spark3-bundle with Spark 3.3 shell, the following exception is 
> thrown.  Some classes are not packaged into the bundle.
> {code:java}
> scala> val df = spark.read.format("hudi").load("")
> java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.hudi.Spark32PlusDefaultSource not found
>   at java.util.ServiceLoader.fail(ServiceLoader.java:239)
>   at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
>   at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
>   at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>   at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
>   at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
>   at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
>   ... 47 elided {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6735: [HUDI-4892] Fix hudi-spark3-bundle

2022-09-21 Thread GitBox


hudi-bot commented on PR #6735:
URL: https://github.com/apache/hudi/pull/6735#issuecomment-1254343526

   
   ## CI report:
   
   * 51c0c21c9f5a689943147a1faded74c67fef61a2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11562)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6498: [HUDI-4878] Fix incremental cleaner use case

2022-09-21 Thread GitBox


hudi-bot commented on PR #6498:
URL: https://github.com/apache/hudi/pull/6498#issuecomment-1254343298

   
   ## CI report:
   
   * 3c05d0af21cc79358b7c0ffb7aad579da19129db Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11561)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6735: [HUDI-4892] Fix hudi-spark3-bundle

2022-09-21 Thread GitBox


hudi-bot commented on PR #6735:
URL: https://github.com/apache/hudi/pull/6735#issuecomment-1254341123

   
   ## CI report:
   
   * 51c0c21c9f5a689943147a1faded74c67fef61a2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6498: [HUDI-4878] Fix incremental cleaner use case

2022-09-21 Thread GitBox


hudi-bot commented on PR #6498:
URL: https://github.com/apache/hudi/pull/6498#issuecomment-1254340889

   
   ## CI report:
   
   * 054e2a560ef080b3591d52f3b2d1cd8b3c2ab0f7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11169)
 
   * 3c05d0af21cc79358b7c0ffb7aad579da19129db UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4892) Fix hudi-spark3-bundle

2022-09-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4892:
-
Labels: pull-request-available  (was: )

> Fix hudi-spark3-bundle
> --
>
> Key: HUDI-4892
> URL: https://issues.apache.org/jira/browse/HUDI-4892
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Using hudi-spark3-bundle with Spark 3.3 shell, the following exception is 
> thrown.  Some classes are not packaged into the bundle.
> {code:java}
> scala> val df = spark.read.format("hudi").load("")
> java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.hudi.Spark32PlusDefaultSource not found
>   at java.util.ServiceLoader.fail(ServiceLoader.java:239)
>   at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
>   at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
>   at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>   at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
>   at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
>   at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
>   ... 47 elided {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua opened a new pull request, #6735: [HUDI-4892] Fix hudi-spark3-bundle

2022-09-21 Thread GitBox


yihua opened a new pull request, #6735:
URL: https://github.com/apache/hudi/pull/6735

   ### Change Logs
   
   This PR fixes the hudi-spark3-bundle.  Before this PR, reading a Hudi table 
with Spark datasource in Spark 3.3 shell with hudi-spark3-bundle throws the 
following exception.  Some classes are not packaged into the spark3 bundle.
   
   ```
   scala> val df = spark.read.format("hudi").load("")
   java.util.ServiceConfigurationError: 
org.apache.spark.sql.sources.DataSourceRegister: Provider 
org.apache.hudi.Spark32PlusDefaultSource not found
 at java.util.ServiceLoader.fail(ServiceLoader.java:239)
 at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
 at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
 at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
 at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
 at 
scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
 at scala.collection.Iterator.foreach(Iterator.scala:943)
 at scala.collection.Iterator.foreach$(Iterator.scala:943)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
 at scala.collection.IterableLike.foreach(IterableLike.scala:74)
 at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
 at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
 at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
 at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
 at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
 at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
 at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
 at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
 at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
 ... 47 elided 
   ```
   
   ### Impact
   
   **Risk level: low**
   
   Fixing the hudi-spark3-bundle packaging only to avoid class not found.
   
   Tested locally and on EMR that the hudi-spark3-bundle works after the fix.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6734: [HUDI-3478][HUDI-4887] Use Avro as the format of persisted cdc data

2022-09-21 Thread GitBox


alexeykudinkin commented on code in PR #6734:
URL: https://github.com/apache/hudi/pull/6734#discussion_r977061910


##
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java:
##
@@ -109,6 +109,11 @@ public static Schema createNullableSchema(Schema.Type 
avroType) {
 return Schema.createUnion(Schema.create(Schema.Type.NULL), 
Schema.create(avroType));

Review Comment:
   Let's rebase this one onto new one you're adding



##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/cdc/TestCDCDataFrameSuite.scala:
##
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.functional.cdc
+
+import org.apache.avro.Schema
+import org.apache.avro.generic.IndexedRecord
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.hudi.DataSourceWriteOptions._
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.model.{HoodieCommitMetadata, HoodieLogFile}
+import org.apache.hudi.common.table.cdc.{HoodieCDCSupplementalLoggingMode, 
HoodieCDCUtils}
+import org.apache.hudi.common.table.log.HoodieLogFormat
+import org.apache.hudi.common.table.log.block.{HoodieDataBlock, HoodieLogBlock}
+import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.timeline.HoodieInstant
+import 
org.apache.hudi.common.testutils.RawTripTestPayload.{deleteRecordsToStrings, 
recordsToStrings}
+import org.apache.hudi.config.{HoodieCleanConfig, HoodieWriteConfig}
+import org.apache.hudi.testutils.HoodieClientTestBase
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.SaveMode
+
+import org.junit.jupiter.api.{AfterEach, BeforeEach}
+import org.junit.jupiter.api.Assertions.{assertEquals, assertFalse, assertTrue}
+import org.junit.jupiter.params.ParameterizedTest
+import org.junit.jupiter.params.provider.CsvSource
+
+import scala.collection.JavaConversions._
+import scala.collection.JavaConverters._
+
+class TestCDCDataFrameSuite extends HoodieClientTestBase {
+
+  var spark: SparkSession = _
+
+  val commonOpts = Map(
+HoodieTableConfig.CDC_ENABLED.key -> "true",
+"hoodie.insert.shuffle.parallelism" -> "4",
+"hoodie.upsert.shuffle.parallelism" -> "4",
+"hoodie.bulkinsert.shuffle.parallelism" -> "2",
+"hoodie.delete.shuffle.parallelism" -> "1",
+RECORDKEY_FIELD.key -> "_row_key",
+PRECOMBINE_FIELD.key -> "timestamp",
+HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
+HoodieMetadataConfig.COMPACT_NUM_DELTA_COMMITS.key -> "1",
+HoodieCleanConfig.AUTO_CLEAN.key -> "false"
+  )
+
+  @BeforeEach override def setUp(): Unit = {
+setTableName("hoodie_test")
+initPath()
+initSparkContexts()
+spark = sqlContext.sparkSession
+initTestDataGenerator()
+initFileSystem()
+  }
+
+  @AfterEach override def tearDown(): Unit = {
+cleanupSparkContexts()
+cleanupTestDataGenerator()
+cleanupFileSystem()
+  }
+
+  @ParameterizedTest
+  @CsvSource(Array("cdc_op_key", "cdc_data_before", "cdc_data_before_after"))
+  def testCOWDataSourceWrite(cdcSupplementalLoggingMode: String): Unit = {
+val options = commonOpts ++ Map(
+  HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE.key -> 
cdcSupplementalLoggingMode
+)
+
+// Insert Operation
+val records1 = recordsToStrings(dataGen.generateInserts("000", 100)).toList
+val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+inputDF1.write.format("org.apache.hudi")
+  .options(options)
+  .mode(SaveMode.Overwrite)
+  .save(basePath)
+
+// init meta client
+metaClient = HoodieTableMetaClient.builder()
+  .setBasePath(basePath)
+  .setConf(spark.sessionState.newHadoopConf)
+  .build()
+val instant1 = metaClient.reloadActiveTimeline.lastInstant().get()
+assertEquals(spark.read.format("hudi").load(basePath).count(), 100)
+// all the data is new-coming, it will write out cdc log files.
+assertFalse(hasCDCLogFile(instant1))
+
+val schemaResolver = new TableSchemaResolver(metaClient)
+val dataSchema = schemaResolver.getTableAvroSchema(false)
+val cdcSchema = 

[jira] [Created] (HUDI-4892) Fix hudi-spark3-bundle

2022-09-21 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-4892:
---

 Summary: Fix hudi-spark3-bundle
 Key: HUDI-4892
 URL: https://issues.apache.org/jira/browse/HUDI-4892
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ethan Guo
Assignee: Ethan Guo
 Fix For: 0.12.1


Using hudi-spark3-bundle with Spark 3.3 shell, the following exception is 
thrown.  Some classes are not packaged into the bundle.
{code:java}
scala> val df = spark.read.format("hudi").load("")
java.util.ServiceConfigurationError: 
org.apache.spark.sql.sources.DataSourceRegister: Provider 
org.apache.hudi.Spark32PlusDefaultSource not found
  at java.util.ServiceLoader.fail(ServiceLoader.java:239)
  at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
  at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
  at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
  at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
  at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
  at scala.collection.Iterator.foreach(Iterator.scala:943)
  at scala.collection.Iterator.foreach$(Iterator.scala:943)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
  at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
  at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
  at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
  at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
  at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
  at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
  at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
  at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
  ... 47 elided {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] nsivabalan commented on pull request #6498: [HUDI-4878] Fix incremental cleaner use case

2022-09-21 Thread GitBox


nsivabalan commented on PR #6498:
URL: https://github.com/apache/hudi/pull/6498#issuecomment-1254323578

   @codope: Can you review this patch. I have overhauled the initial fix put 
up. But could result in good perf improv for cleaning. I am yet to write tests. 
but do take a look at my logic and let me know if it looks ok. or is there any 
case that I could be missing. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] CTTY commented on a diff in pull request #5113: [HUDI-3625] [RFC-60] Optimized storage layout for Cloud Object Stores

2022-09-21 Thread GitBox


CTTY commented on code in PR #5113:
URL: https://github.com/apache/hudi/pull/5113#discussion_r977051856


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,226 @@
+
+
+# RFC-56: Federated Storage Layer
+
+## Proposers
+- @umehrot2
+
+## Approvers
+- @vinoth
+- @shivnarayan
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625)
+
+## Abstract
+
+As you scale your Apache Hudi workloads over Cloud object stores like Amazon 
S3, there is potential of hitting request
+throttling limits which in-turn impacts performance. In this RFC, we are 
proposing to support an alternate storage
+layout that is optimized for Amazon S3 and other cloud object stores, which 
helps achieve maximum throughput and
+significantly reduce throttling.
+
+In addition, we are proposing an interface that would allow users to implement 
their own custom strategy to allow them
+to distribute the data files across cloud stores, hdfs or on prem based on 
their specific use-cases.
+
+## Background
+
+Apache Hudi follows the traditional Hive storage layout while writing files on 
storage:
+- Partitioned Tables: The files are distributed across multiple physical 
partition folders, under the table's base path.
+- Non Partitioned Tables: The files are stored directly under the table's base 
path.
+
+While this storage layout scales well for HDFS, it increases the probability 
of hitting request throttle limits when
+working with cloud object stores like Amazon S3 and others. This is because 
Amazon S3 and other cloud stores [throttle
+requests based on object 
prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
+Amazon S3 does scale based on request patterns for different prefixes and adds 
internal partitions (with their own request limits),
+but there can be a 30 - 60 minute wait time before new partitions are created. 
Thus, all files/objects stored under the
+same table path prefix could result in these request limits being hit for the 
table prefix, specially as workloads
+scale, and there are several thousands of files being written/updated 
concurrently. This hurts performance due to
+re-trying of failed requests affecting throughput, and result in occasional 
failures if the retries are not able to
+succeed either and continue to be throttled.
+
+The traditional storage layout also tightly couples the partitions as folders 
under the table path. However,
+some users want flexibility to be able to distribute files/partitions under 
multiple different paths across cloud stores,
+hdfs etc. based on their specific needs. For example, customers have use cases 
to distribute files for each partition under
+a separate S3 bucket with its individual encryption key. It is not possible to 
implement such use-cases with Hudi currently.
+
+The high level proposal here is to introduce a new storage layout strategy, 
where all files are distributed evenly across
+multiple randomly generated prefixes under the Amazon S3 bucket, instead of 
being stored under a common table path/prefix.
+This would help distribute the requests evenly across different prefixes, 
resulting in Amazon S3 to create partitions for
+the prefixes each with its own request limit. This significantly reduces the 
possibility of hitting the request limit
+for a specific prefix/partition.
+
+In addition, we want to expose an interface that provides users the 
flexibility to implement their own strategy for
+distributing files if using the traditional Hive storage layout or federated 
storage layer (proposed in this RFC) does
+not meet their use-case.
+
+## Design
+
+### Interface
+
+```java
+/**
+ * Interface for providing storage file locations.
+ */
+public interface FederatedStorageStrategy extends Serializable {
+  /**
+   * Return a fully-qualified storage file location for the given filename.
+   *
+   * @param fileName data file name
+   * @return a fully-qualified location URI for a data file
+   */
+  String storageLocation(String fileName);
+
+  /**
+   * Return a fully-qualified storage file location for the given partition 
and filename.
+   *
+   * @param partitionPath partition path for the file
+   * @param fileName data file name
+   * @return a fully-qualified location URI for a data file
+   */
+  String storageLocation(String partitionPath, String fileName);
+}
+```
+
+### Generating file paths for Cloud storage optimized layout
+
+We want to distribute files evenly across multiple random prefixes, instead of 
following the traditional Hive storage
+layout of keeping them under a common table path/prefix. In addition to the 
`Table Path`, for this new layout user will
+configure another `Table Storage Path` under which the actual data files will 
be distributed. The original `Table Path` will
+be used to maintain the table/partitions Hudi metadata.
+
+For the purpose of this documentation lets assume:
+```
+Table Path => s3:
+
+Table 

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-09-21 Thread GitBox


alexeykudinkin commented on code in PR #5629:
URL: https://github.com/apache/hudi/pull/5629#discussion_r977042462


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecord.java:
##
@@ -291,59 +284,51 @@ public void checkState() {
 }
   }
 
-  
//
-
-  //
-  // NOTE: This method duplicates those ones of the HoodieRecordPayload and 
are placed here
-  //   for the duration of RFC-46 implementation, until migration off 
`HoodieRecordPayload`
-  //   is complete
-  //
-  public abstract HoodieRecord mergeWith(HoodieRecord other, Schema 
readerSchema, Schema writerSchema) throws IOException;
+  /**
+   * Get column in record to support RDDCustomColumnsSortPartitioner
+   */
+  public abstract Object getRecordColumnValues(Schema recordSchema, String[] 
columns, boolean consistentLogicalTimestampEnabled);
 
-  public abstract HoodieRecord rewriteRecord(Schema recordSchema, Schema 
targetSchema, TypedProperties props) throws IOException;
+  /**
+   * Support bootstrap.
+   */
+  public abstract HoodieRecord mergeWith(HoodieRecord other, Schema 
targetSchema) throws IOException;

Review Comment:
   Understood. Let's keep it for now, but just rename it to `joinWith` to avoid 
confusion



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-09-21 Thread GitBox


alexeykudinkin commented on code in PR #5629:
URL: https://github.com/apache/hudi/pull/5629#discussion_r977041457


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkStructTypeSerializer.scala:
##
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import com.esotericsoftware.kryo.Kryo
+import com.esotericsoftware.kryo.io.{Input, Output}
+import com.twitter.chill.KSerializer
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream, 
ObjectInputStream, ObjectOutputStream}
+import java.nio.ByteBuffer
+import java.nio.charset.StandardCharsets
+import org.apache.avro.SchemaNormalization
+import org.apache.commons.io.IOUtils
+import org.apache.hudi.commmon.model.HoodieSparkRecord
+import org.apache.spark.io.CompressionCodec
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+import org.apache.spark.{SparkEnv, SparkException}
+import scala.collection.mutable
+
+/**
+ * Custom serializer used for generic spark records. If the user registers the 
schemas
+ * ahead of time, then the schema's fingerprint will be sent with each message 
instead of the actual
+ * schema, as to reduce network IO.
+ * Actions like parsing or compressing schemas are computationally expensive 
so the serializer
+ * caches all previously seen values as to reduce the amount of work needed to 
do.
+ * @param schemas a map where the keys are unique IDs for spark schemas and 
the values are the
+ *string representation of the Avro schema, used to decrease 
the amount of data
+ *that needs to be serialized.
+ */
+class SparkStructTypeSerializer(schemas: Map[Long, StructType]) extends 
KSerializer[HoodieSparkRecord] {

Review Comment:
   https://hudi.apache.org/docs/quick-start-guide



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-09-21 Thread GitBox


alexeykudinkin commented on code in PR #5629:
URL: https://github.com/apache/hudi/pull/5629#discussion_r977040996


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkStructTypeSerializer.scala:
##
@@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import com.esotericsoftware.kryo.Kryo
+import com.esotericsoftware.kryo.io.{Input, Output}
+import com.twitter.chill.KSerializer
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream, 
ObjectInputStream, ObjectOutputStream}
+import java.nio.ByteBuffer
+import java.nio.charset.StandardCharsets
+import org.apache.avro.SchemaNormalization
+import org.apache.commons.io.IOUtils
+import org.apache.hudi.commmon.model.HoodieSparkRecord
+import org.apache.spark.io.CompressionCodec
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+import org.apache.spark.{SparkEnv, SparkException}
+import scala.collection.mutable
+
+/**
+ * Custom serializer used for generic spark records. If the user registers the 
schemas
+ * ahead of time, then the schema's fingerprint will be sent with each message 
instead of the actual
+ * schema, as to reduce network IO.
+ * Actions like parsing or compressing schemas are computationally expensive 
so the serializer
+ * caches all previously seen values as to reduce the amount of work needed to 
do.
+ * @param schemas a map where the keys are unique IDs for spark schemas and 
the values are the
+ *string representation of the Avro schema, used to decrease 
the amount of data
+ *that needs to be serialized.
+ */
+class SparkStructTypeSerializer(schemas: Map[Long, StructType]) extends 
KSerializer[HoodieSparkRecord] {

Review Comment:
   Sorry, my bad i wasn't clear enough -- we will have to
   
   - Implement Registrar to make sure it does register our custom serializer
   - Make sure we update the docs to include it (and make sure to highlight it 
in the change-log), similarly to how we recommend including `spark.serializer` 
config



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #6640: [SUPPORT] HUDI partition table duplicate data cow hudi 0.10.0 flink 1.13.1

2022-09-21 Thread GitBox


yihua commented on issue #6640:
URL: https://github.com/apache/hudi/issues/6640#issuecomment-1254303659

   @yuzhaojing @danny0405 Could any one of you chime in here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #6644: Hudi Multi Writer DynamoDBBasedLocking issue

2022-09-21 Thread GitBox


yihua commented on issue #6644:
URL: https://github.com/apache/hudi/issues/6644#issuecomment-1254302236

   @koochiswathiTR Thanks for raising this!  The config naming of 
`partition_key` is confusing to new comers.  Here's what you need to do:
   (1) As @xushiyan already mentioned, you don't need to set the credentials in 
env variables if the instance or service is already granted access with the 
proper roles;
   (2) By default, `hoodie.write.lock.dynamodb.partition_key` is set to the 
table name, so that multiple writers writing to the same table share the same 
lock.  If you customize the name, make sure it's the same for multiple writers;
   (3) Note that, what `hoodie.write.lock.dynamodb.partition_key` specifies 
actually means the value to use for the column, and not the column name itself. 
 The column name is fixed to be `key` in DynamoDB table;
   (4) The DynamoDB table for locking purposes is automatically created from 
the Hudi code, so you don't have to create the table yourself.  If you do so, 
make sure that the `key` column is present in the table, not `lock` or the 
value specified by `hoodie.write.lock.dynamodb.partition_key`.
   
   Let me know if this solves your problem.  Feel free to close it once all 
good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4015: [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data

2022-09-21 Thread GitBox


hudi-bot commented on PR #4015:
URL: https://github.com/apache/hudi/pull/4015#issuecomment-1254300684

   
   ## CI report:
   
   * 375927ade5b4b327e44ebc227fb57e64de524fcc Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3426)
 
   * e1cf530fbae41de33cb9cc76a16a2e6dc5425837 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11560)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #4015: [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data

2022-09-21 Thread GitBox


hudi-bot commented on PR #4015:
URL: https://github.com/apache/hudi/pull/4015#issuecomment-1254296165

   
   ## CI report:
   
   * 375927ade5b4b327e44ebc227fb57e64de524fcc Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3426)
 
   * e1cf530fbae41de33cb9cc76a16a2e6dc5425837 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] bhasudha commented on a diff in pull request #6638: [DOCS] Add tags to blog pages

2022-09-21 Thread GitBox


bhasudha commented on code in PR #6638:
URL: https://github.com/apache/hudi/pull/6638#discussion_r977030742


##
README.md:
##
@@ -156,6 +156,44 @@ Example: When you change any file in 
`versioned_docs/version-0.7.0/`, it will on
 ## Configs
 Configs can be automatically updated by following these steps documented at 
../hudi-utils/README.md
 
+## Blogs
+
+When adding a new blog, please follow these guidelines.
+
+1. Every Blog should have the `title`, `authors`, `image`, `tags` in the 
metadata of the blog. For example the front matter 
+for a blog should look like below. 
+```
+---
+title: "Blog title"
+author: FirstName LastName
+category: blog
+image: /assets/images/blog/
+tags:
+- how-to
+- deltastreamer
+- incremental-processing
+- apache hudi
+---
+```
+2. The blog can be inline or referring to an external blog. If its an inline 
blog please save it as `.md` file. 
+Example for an inline blog - (Build Open Lakehouse using Apache Hudi & 
dbt)[https://github.com/apache/hudi/blob/asf-site/website/blog/2022-07-11-build-open-lakehouse-using-apache-hudi-and-dbt.md].
 
+If the blog is referring to an external blog you would need to embed the 
redirect url and save it as a `.mdx` file. 
+Take a look at this blog for reference - (Apache Hudi vs Delta Lake vs Apache 
Iceberg - Lakehouse Feature 
Compariso)[https://raw.githubusercontent.com/apache/hudi/asf-site/website/blog/2022-08-18-Apache-Hudi-vs-Delta-Lake-vs-Apache-Iceberg-Lakehouse-Feature-Comparison.mdx]
+3. The image must be uploaded in the path 
/assets/images/blog/ and should be of standard size 1200 * 600
+4. The tags should be representative of these
+   1. tag1
+  - how-to (tutorial, recipes, show case how to use feature x)

Review Comment:
   ah yes. I ll add  it in a followup pr



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch asf-site updated: [DOCS] Add tags to blog pages (#6638)

2022-09-21 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 12ebe2bdef [DOCS] Add tags to blog pages (#6638)
12ebe2bdef is described below

commit 12ebe2bdef369cf7eb80cb2767e88fbbcb4f10d6
Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com>
AuthorDate: Wed Sep 21 15:18:50 2022 -0700

[DOCS] Add tags to blog pages (#6638)
---
 README.md  | 38 ++
 ...e-Case-for-incremental-processing-on-Hadoop.mdx |  4 +++
 ...-Incremental-Processing-Framework-on-Hadoop.mdx |  4 +++
 .../blog/2019-05-14-registering-dataset-to-hive.md |  3 ++
 .../blog/2019-09-09-ingesting-database-changes.md  |  3 ++
 website/blog/2019-10-22-Hudi-On-Hops.mdx   |  3 ++
 ...-Data-on-S3-with-Amazon-EMR-and-Apache-Hudi.mdx |  3 ++
 website/blog/2020-01-15-delete-support-in-hudi.md  |  4 +++
 .../blog/2020-01-20-change-capture-using-aws.md|  5 +++
 website/blog/2020-03-22-exporting-hudi-datasets.md |  4 +++
 .../blog/2020-04-27-apache-hudi-apache-zepplin.md  |  4 +++
 ...0-05-28-monitoring-hudi-metrics-with-datadog.md |  4 +++
 ...nnounces-Apache-Hudi-as-a-Top-Level-Project.mdx |  3 ++
 ...ctional-Data-Lake-at-Uber-Using-Apache-Hudi.mdx |  5 +++
 ...-Apache-Hudi-grows-cloud-data-lake-maturity.mdx |  3 ++
 .../blog/2020-08-04-PrestoDB-and-Apache-Hudi.mdx   |  3 ++
 ...18-hudi-incremental-processing-on-data-lakes.md |  5 +++
 ...-efficient-migration-of-large-parquet-tables.md |  5 +++
 ...2020-08-21-async-compaction-deployment-model.md |  4 +++
 ...2020-08-22-ingest-multiple-tables-using-hudi.md |  4 +++
 ...020-10-06-cdc-solution-using-hudi-by-nclouds.md |  4 +++
 .../2020-10-15-apache-hudi-meets-apache-flink.md   |  4 +++
 .../2020-10-19-Origins-of-Data-Lake-at-Grofers.mdx |  6 
 .../2020-10-19-hudi-meets-aws-emr-and-aws-dms.md   |  3 ++
 ...Enterprise-at-Data-Summit-Connect-Fall-2020.mdx |  3 ++
 ...apture-using-Apache-Hudi-and-Amazon-AMS-EMR.mdx |  5 +++
 .../blog/2020-11-11-hudi-indexing-mechanisms.md|  4 +++
 ...-11-29-Can-Big-Data-Solutions-Be-Affordable.mdx |  5 +++
 ...gh-perf-data-lake-with-hudi-and-alluxio-t3go.md |  6 
 website/blog/2021-01-27-hudi-clustering-intro.md   |  4 +++
 website/blog/2021-02-13-hudi-key-generators.md |  4 +++
 ...ravel-operations-in-Hopsworks-Feature-Store.mdx |  6 
 ...-Generation-of-Data-Lakes-using-Apache-Hudi.mdx |  4 +++
 website/blog/2021-03-01-hudi-file-sizing.md|  4 +++
 ...-stream-for-amazon-dynamodb-and-apache-hudi.mdx |  4 +++
 ...New-features-from-Apache-hudi-in-Amazon-EMR.mdx |  3 ++
 ...-Apache-Spark-and-Apache-Hudi-on-Amazon-EMR.mdx |  4 +++
 .../2021-05-12-Experts-primer-on-Apache-Hudi.mdx   |  3 ++
 ...ow-Uber-gets-data-a-ride-to-its-destination.mdx |  3 ++
 ...loying-right-configurations-for-hudi-cleaner.md |  6 +++-
 ...6-Amazon-Athena-expands-Apache-Hudi-support.mdx |  3 ++
 ...e-with-amazon-athena-Read-optimized-queries.mdx |  4 +++
 .../2021-07-21-streaming-data-lake-platform.md |  4 +++
 ...-lake-evolution-scheme-based-on-Apache-Hudi.mdx |  5 +++
 ...ars-Versioned-Feature-Data-with-a-Lakehouse.mdx |  7 
 ...cient-Open-Source-Big-Data-Platform-at-Uber.mdx |  7 
 .../blog/2021-08-16-kafka-custom-deserializer.md   |  6 
 .../blog/2021-08-18-improving-marker-mechanism.md  |  5 +++
 website/blog/2021-08-18-virtual-keys.md|  4 +++
 website/blog/2021-08-23-async-clustering.md|  4 +++
 website/blog/2021-08-23-s3-events-source.md|  4 +++
 ...g-eb-level-data-lake-using-hudi-at-bytedance.md |  3 ++
 .../blog/2021-10-05-Data-Platform-2.0-Part-I.mdx   |  5 +++
 ...abyte-scale-using-AWS-Glue-with-Apache-Hudi.mdx |  5 +++
 ...n-building-real-time-data-lake-at-station-B.mdx |  4 +++
 ...-at-enterprise-scale-using-the-AWS-platform.mdx |  4 +++
 ...-Hudi-Architecture-Tools-and-Best-Practices.mdx |  3 ++
 ...se-concurrency-control-are-we-too-optimistic.md |  4 +++
 ...udi-0.7.0-and-0.8.0-available-on-Amazon-EMR.mdx |  3 ++
 ...hudi-zorder-and-hilbert-space-filling-curves.md |  5 +++
 ...es-with-Apache-Hudi-Kafka-Hive-and-Debezium.mdx |  4 +++
 ...2022-01-06-apache-hudi-2021-a-year-in-review.md |  4 +++
 ...e-data-capture-with-debezium-and-apache-hudi.md |  7 +++-
 ...nd-How-I-Integrated-Airbyte-and-Apache-Hudi.mdx |  4 +++
 ...-lake-efforts-at-Walmart-and-Disney-Hotstar.mdx |  3 ++
 ...st-Efficiency-Scale-in-Big-Data-File-Format.mdx |  6 
 .../2022-02-02-Onehouse-Commitment-to-Openness.mdx |  4 +++
 ...gs-a-fully-managed-lakehouse-to-Apache-Hudi.mdx |  4 +++
 ...-transformations-on-Distributed-file-system.mdx |  3 ++
 ...ating-Current-Interest-and-Rate-of-Adoption.mdx |  6 
 .../2022-02-17-Fresher-Data-Lake-on-AWS-S3.mdx |  4 +++
 ...s-core-concepts-from-hudi-persistence-files.mdx |  4 +++
 

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6638: [DOCS] Add tags to blog pages

2022-09-21 Thread GitBox


nsivabalan commented on code in PR #6638:
URL: https://github.com/apache/hudi/pull/6638#discussion_r976869703


##
README.md:
##
@@ -156,6 +156,44 @@ Example: When you change any file in 
`versioned_docs/version-0.7.0/`, it will on
 ## Configs
 Configs can be automatically updated by following these steps documented at 
../hudi-utils/README.md
 
+## Blogs
+
+When adding a new blog, please follow these guidelines.
+
+1. Every Blog should have the `title`, `authors`, `image`, `tags` in the 
metadata of the blog. For example the front matter 
+for a blog should look like below. 
+```
+---
+title: "Blog title"
+author: FirstName LastName
+category: blog
+image: /assets/images/blog/
+tags:
+- how-to
+- deltastreamer
+- incremental-processing
+- apache hudi
+---
+```
+2. The blog can be inline or referring to an external blog. If its an inline 
blog please save it as `.md` file. 
+Example for an inline blog - (Build Open Lakehouse using Apache Hudi & 
dbt)[https://github.com/apache/hudi/blob/asf-site/website/blog/2022-07-11-build-open-lakehouse-using-apache-hudi-and-dbt.md].
 
+If the blog is referring to an external blog you would need to embed the 
redirect url and save it as a `.mdx` file. 
+Take a look at this blog for reference - (Apache Hudi vs Delta Lake vs Apache 
Iceberg - Lakehouse Feature 
Compariso)[https://raw.githubusercontent.com/apache/hudi/asf-site/website/blog/2022-08-18-Apache-Hudi-vs-Delta-Lake-vs-Apache-Iceberg-Lakehouse-Feature-Comparison.mdx]
+3. The image must be uploaded in the path 
/assets/images/blog/ and should be of standard size 1200 * 600
+4. The tags should be representative of these
+   1. tag1
+  - how-to (tutorial, recipes, show case how to use feature x)

Review Comment:
   guess we might need to add blog as one of the value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan merged pull request #6638: [DOCS] Add tags to blog pages

2022-09-21 Thread GitBox


nsivabalan merged PR #6638:
URL: https://github.com/apache/hudi/pull/6638


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #6398: [SUPPORT] Metadata table thows hbase exceptions

2022-09-21 Thread GitBox


yihua commented on issue #6398:
URL: https://github.com/apache/hudi/issues/6398#issuecomment-1254287346

   > @yihua yes this parameter is placed in separate hbase-site.xml which is 
used by spark.
   
   Thanks for the confirmation!  I'll also list this as a workaround in our FAQ.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #4015: [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data

2022-09-21 Thread GitBox


nsivabalan commented on PR #4015:
URL: https://github.com/apache/hudi/pull/4015#issuecomment-1254285358

   have pushed out a commit by myself to address feedback. yet to see if we can 
cover the fix w/ a test. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua closed issue #6658: [SUPPORT] undrop table

2022-09-21 Thread GitBox


yihua closed issue #6658: [SUPPORT] undrop table
URL: https://github.com/apache/hudi/issues/6658


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #6658: [SUPPORT] undrop table

2022-09-21 Thread GitBox


yihua commented on issue #6658:
URL: https://github.com/apache/hudi/issues/6658#issuecomment-1254283326

   @melin Thank you for raising this feature request!  I created a Jira ticket 
to track the work and let's follow up there: HUDI-4891.  Closing this support 
ticket.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6516: [HUDI-4729] Fix fq can not be queried in pending compaction when query ro table with spark

2022-09-21 Thread GitBox


alexeykudinkin commented on code in PR #6516:
URL: https://github.com/apache/hudi/pull/6516#discussion_r977025908


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java:
##
@@ -665,13 +671,21 @@ public final Stream 
getLatestFileSlicesBeforeOrOn(String partitionStr
   readLock.lock();
   String partitionPath = formatPartitionKey(partitionStr);
   ensurePartitionLoadedCorrectly(partitionPath);
-  Stream fileSliceStream = 
fetchLatestFileSlicesBeforeOrOn(partitionPath, maxCommitTime)
-  .filter(slice -> 
!isFileGroupReplacedBeforeOrOn(slice.getFileGroupId(), maxCommitTime));
+  Stream> allFileSliceStream = 
fetchAllStoredFileGroups(partitionPath)

Review Comment:
   `Stream>` doesn't make sense, let's flat-map it 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4891) Support UNDROP TABLE in Spark SQL

2022-09-21 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4891:

Description: 
Specifies the identifier for the table to restore. If the identifier contains 
spaces or special characters, the entire string must be enclosed in double 
quotes. Identifiers enclosed in double quotes are also case-sensitive.
 # Restoring tables is only supported in the current schema or current 
database, even if the table name is fully-qualified.
 # If a table with the same name already exists, an error is returned.
 # UNDROP relies on the Snowflake Time Travel feature. An object can be 
restored only if the object was deleted within the. The default value is 24 
hours.

[https://docs.snowflake.com/en/sql-reference/sql/undrop-table.html]

> Support UNDROP TABLE in Spark SQL
> -
>
> Key: HUDI-4891
> URL: https://issues.apache.org/jira/browse/HUDI-4891
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
>
> Specifies the identifier for the table to restore. If the identifier contains 
> spaces or special characters, the entire string must be enclosed in double 
> quotes. Identifiers enclosed in double quotes are also case-sensitive.
>  # Restoring tables is only supported in the current schema or current 
> database, even if the table name is fully-qualified.
>  # If a table with the same name already exists, an error is returned.
>  # UNDROP relies on the Snowflake Time Travel feature. An object can be 
> restored only if the object was deleted within the. The default value is 24 
> hours.
> [https://docs.snowflake.com/en/sql-reference/sql/undrop-table.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4891) Support UNDROP TABLE in Spark SQL

2022-09-21 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-4891:
---

 Summary: Support UNDROP TABLE in Spark SQL
 Key: HUDI-4891
 URL: https://issues.apache.org/jira/browse/HUDI-4891
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4891) Support UNDROP TABLE in Spark SQL

2022-09-21 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4891:

Fix Version/s: 1.0.0

> Support UNDROP TABLE in Spark SQL
> -
>
> Key: HUDI-4891
> URL: https://issues.apache.org/jira/browse/HUDI-4891
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 1.0.0
>
>
> Specifies the identifier for the table to restore. If the identifier contains 
> spaces or special characters, the entire string must be enclosed in double 
> quotes. Identifiers enclosed in double quotes are also case-sensitive.
>  # Restoring tables is only supported in the current schema or current 
> database, even if the table name is fully-qualified.
>  # If a table with the same name already exists, an error is returned.
>  # UNDROP relies on the Snowflake Time Travel feature. An object can be 
> restored only if the object was deleted within the. The default value is 24 
> hours.
> [https://docs.snowflake.com/en/sql-reference/sql/undrop-table.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-21 Thread GitBox


alexeykudinkin commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r977019700


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java:
##
@@ -275,6 +345,66 @@ private HoodieData> 
readRecordsForGroupBaseFiles(JavaSparkContex
 .map(record -> transform(record, writeConfig)));
   }
 
+  /**
+   * Get dataset of all records for the group. This includes all records from 
file slice (Apply updates from log files, if any).
+   */
+  private Dataset readRecordsForGroupAsRow(JavaSparkContext jsc,
+HoodieClusteringGroup 
clusteringGroup,
+String instantTime) {
+List clusteringOps = 
clusteringGroup.getSlices().stream()
+.map(ClusteringOperation::create).collect(Collectors.toList());
+boolean hasLogFiles = clusteringOps.stream().anyMatch(op -> 
op.getDeltaFilePaths().size() > 0);
+SQLContext sqlContext = new SQLContext(jsc.sc());
+
+Path[] baseFilePaths = clusteringOps
+.stream()
+.map(op -> {
+  ArrayList readPaths = new ArrayList<>();
+  if (op.getBootstrapFilePath() != null) {
+readPaths.add(op.getBootstrapFilePath());
+  }
+  if (op.getDataFilePath() != null) {
+readPaths.add(op.getDataFilePath());
+  }
+  return readPaths;
+})
+.flatMap(Collection::stream)
+.filter(path -> !path.isEmpty())
+.map(Path::new)
+.toArray(Path[]::new);
+
+HashMap params = new HashMap<>();
+params.put("hoodie.datasource.query.type", "snapshot");
+params.put("as.of.instant", instantTime);
+
+Path[] paths;
+if (hasLogFiles) {
+  String compactionFractor = 
Option.ofNullable(getWriteConfig().getString("compaction.memory.fraction"))
+  .orElse("0.75");
+  params.put("compaction.memory.fraction", compactionFractor);
+
+  Path[] deltaPaths = clusteringOps
+  .stream()
+  .filter(op -> !op.getDeltaFilePaths().isEmpty())
+  .flatMap(op -> op.getDeltaFilePaths().stream())
+  .map(Path::new)
+  .toArray(Path[]::new);
+  paths = CollectionUtils.combine(baseFilePaths, deltaPaths);
+} else {
+  paths = baseFilePaths;
+}
+
+String readPathString = String.join(",", 
Arrays.stream(paths).map(Path::toString).toArray(String[]::new));
+params.put("hoodie.datasource.read.paths", readPathString);
+// Building HoodieFileIndex needs this param to decide query path
+params.put("glob.paths", readPathString);
+
+// Let Hudi relations to fetch the schema from the table itself
+BaseRelation relation = SparkAdapterSupport$.MODULE$.sparkAdapter()

Review Comment:
   :+1:



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-21 Thread GitBox


alexeykudinkin commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1254275167

   @boneanxs thank you very much for iterating on this one! Truly monumental 
effort!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-21 Thread GitBox


alexeykudinkin commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1254275527

   Did you try to re-run your benchmark after the changes we've made? If so, 
can you please paste the results in here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #6686: Apache Hudi Consistency issues with glue and marketplace connector

2022-09-21 Thread GitBox


yihua commented on issue #6686:
URL: https://github.com/apache/hudi/issues/6686#issuecomment-1254272868

   @asankadarshana007  The consistency check, when enabled, happens when 
removing invalid data files: (1) check that all paths to delete exist, (2) 
delete them, (3) wait for all paths to disappear after eventual consistency.  
Note that this logic is not needed for strong consistency.  As the invalid data 
files are now determined based on the markers, there could be a case where a 
marker is created, but the data file has not started being written, so that the 
check (1) fails, which is okay.  Given that there is no use case for the 
eventual consistency atm, we don't maintain the logic.
   
   Let me know if turning off `hoodie.consistency.check.enabled` solves your 
problem.  You can close the ticket if all good.
   
   ```
 if (!invalidDataPaths.isEmpty()) {
   LOG.info("Removing duplicate data files created due to task retries 
before committing. Paths=" + invalidDataPaths);
   Map>> invalidPathsByPartition = 
invalidDataPaths.stream()
   .map(dp -> Pair.of(new Path(basePath, 
dp).getParent().toString(), new Path(basePath, dp).toString()))
   .collect(Collectors.groupingBy(Pair::getKey));
   
   // Ensure all files in delete list is actually present. This is 
mandatory for an eventually consistent FS.
   // Otherwise, we may miss deleting such files. If files are not 
found even after retries, fail the commit
   if (consistencyCheckEnabled) {
 // This will either ensure all files to be deleted are present.
 waitForAllFiles(context, invalidPathsByPartition, 
FileVisibility.APPEAR);
   }
   
   // Now delete partially written files
   context.setJobStatus(this.getClass().getSimpleName(), "Delete all 
partially written files: " + config.getTableName());
   deleteInvalidFilesByPartitions(context, invalidPathsByPartition);
   
   // Now ensure the deleted files disappear
   if (consistencyCheckEnabled) {
 // This will either ensure all files to be deleted are absent.
 waitForAllFiles(context, invalidPathsByPartition, 
FileVisibility.DISAPPEAR);
   }
 }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-3796) Implement layout to filter out uncommitted log files without reading the log blocks

2022-09-21 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607985#comment-17607985
 ] 

sivabalan narayanan commented on HUDI-3796:
---

changing the name of the log file is a pretty big change. dont' think we can 
get it into 0.12.1. punting this for now. 

> Implement layout to filter out uncommitted log files without reading the log 
> blocks
> ---
>
> Key: HUDI-3796
> URL: https://issues.apache.org/jira/browse/HUDI-3796
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.1
>
>
> Related: HUDI-3637
> At high level, getLatestFileSlices() is going to fetch the latest file slices 
> for committed base files and filter out any file slices with the uncommitted 
> base instant time.  The uncommitted log files in the latest file slices may 
> be included, and they are skipped while doing log reading and merging, i.e., 
> the logic in "AbstractHoodieLogRecordReader".
> We can use log instant time instead of base instant time for the log file 
> name so that it is able to filter out uncommitted log files without reading 
> the log blocks beforehand.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3796) Implement layout to filter out uncommitted log files without reading the log blocks

2022-09-21 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3796:
--
Sprint:   (was: 2022/09/19)

> Implement layout to filter out uncommitted log files without reading the log 
> blocks
> ---
>
> Key: HUDI-3796
> URL: https://issues.apache.org/jira/browse/HUDI-3796
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.1
>
>
> Related: HUDI-3637
> At high level, getLatestFileSlices() is going to fetch the latest file slices 
> for committed base files and filter out any file slices with the uncommitted 
> base instant time.  The uncommitted log files in the latest file slices may 
> be included, and they are skipped while doing log reading and merging, i.e., 
> the logic in "AbstractHoodieLogRecordReader".
> We can use log instant time instead of base instant time for the log file 
> name so that it is able to filter out uncommitted log files without reading 
> the log blocks beforehand.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   >