date:20220214

[GitHub] [hudi] yanghua commented on issue #4558: [BUG] Bootstrap operation data loading missing

2022-02-14 Thread GitBox



yanghua commented on issue #4558:
URL: https://github.com/apache/hudi/issues/4558#issuecomment-1039934131


   @waywtdcc Does Hudi 0.10 also have this issue, did you try? IMHO, there are 
some issues in Hudi 0.9 with Flink integration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4212: [HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations are present.

2022-02-14 Thread GitBox



hudi-bot commented on pull request #4212:
URL: https://github.com/apache/hudi/pull/4212#issuecomment-1039921232


   
   ## CI report:
   
   * cb266359f3a11e4fcff70e4df55427381c5843c1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6020)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4212: [HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations are present.

2022-02-14 Thread GitBox



hudi-bot removed a comment on pull request #4212:
URL: https://github.com/apache/hudi/pull/4212#issuecomment-1039868384


   
   ## CI report:
   
   * 96299439e77b68cf82478a07f51bef7c0b003b4c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5881)
 
   * cb266359f3a11e4fcff70e4df55427381c5843c1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6020)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #3808: [HUDI-2560] introduce id_based schema to support full schema evolution.

2022-02-14 Thread GitBox



hudi-bot commented on pull request #3808:
URL: https://github.com/apache/hudi/pull/3808#issuecomment-1039903448


   
   ## CI report:
   
   * ad1630ead2e2530fe92c54883bc55f7852d9af10 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6018)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #3808: [HUDI-2560] introduce id_based schema to support full schema evolution.

2022-02-14 Thread GitBox



hudi-bot removed a comment on pull request #3808:
URL: https://github.com/apache/hudi/pull/3808#issuecomment-1039831567


   
   ## CI report:
   
   * c585862a92bd8c1c184cfa2dd90fc7fc23830886 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6016)
 
   * ad1630ead2e2530fe92c54883bc55f7852d9af10 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6018)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4792: [HUDI-3366] Remove hardcoded logic of disabling metadata table in tests

2022-02-14 Thread GitBox



hudi-bot commented on pull request #4792:
URL: https://github.com/apache/hudi/pull/4792#issuecomment-1039892648


   
   ## CI report:
   
   * 92d8377315d30921e2fecdf0472367ba15818a5c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6017)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4792: [HUDI-3366] Remove hardcoded logic of disabling metadata table in tests

2022-02-14 Thread GitBox



hudi-bot removed a comment on pull request #4792:
URL: https://github.com/apache/hudi/pull/4792#issuecomment-1039801948


   
   ## CI report:
   
   * d8967834c4778bbbdaad3aef70de4efa7c4a5d6d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6007)
 
   * 92d8377315d30921e2fecdf0472367ba15818a5c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6017)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4777: [HUDI-2931] Add config to disable table services

2022-02-14 Thread GitBox



hudi-bot commented on pull request #4777:
URL: https://github.com/apache/hudi/pull/4777#issuecomment-1039885425


   
   ## CI report:
   
   * 9c350835427bfdc586f0be3834ead1db6691cba3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6015)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4777: [HUDI-2931] Add config to disable table services

2022-02-14 Thread GitBox



hudi-bot removed a comment on pull request #4777:
URL: https://github.com/apache/hudi/pull/4777#issuecomment-1039784186


   
   ## CI report:
   
   * 9c350835427bfdc586f0be3834ead1db6691cba3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6015)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4212: [HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations are present.

2022-02-14 Thread GitBox



hudi-bot removed a comment on pull request #4212:
URL: https://github.com/apache/hudi/pull/4212#issuecomment-1039866841


   
   ## CI report:
   
   * 96299439e77b68cf82478a07f51bef7c0b003b4c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5881)
 
   * cb266359f3a11e4fcff70e4df55427381c5843c1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4212: [HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations are present.

2022-02-14 Thread GitBox



hudi-bot commented on pull request #4212:
URL: https://github.com/apache/hudi/pull/4212#issuecomment-1039868384


   
   ## CI report:
   
   * 96299439e77b68cf82478a07f51bef7c0b003b4c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5881)
 
   * cb266359f3a11e4fcff70e4df55427381c5843c1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6020)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4212: [HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations are present.

2022-02-14 Thread GitBox



hudi-bot removed a comment on pull request #4212:
URL: https://github.com/apache/hudi/pull/4212#issuecomment-1035259608


   
   ## CI report:
   
   * 96299439e77b68cf82478a07f51bef7c0b003b4c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5881)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4212: [HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations are present.

2022-02-14 Thread GitBox



hudi-bot commented on pull request #4212:
URL: https://github.com/apache/hudi/pull/4212#issuecomment-1039866841


   
   ## CI report:
   
   * 96299439e77b68cf82478a07f51bef7c0b003b4c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5881)
 
   * cb266359f3a11e4fcff70e4df55427381c5843c1 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #4212: [HUDI-2925] Fix duplicate cleaning of same files when unfinished clean operations are present.

2022-02-14 Thread GitBox



nsivabalan commented on a change in pull request #4212:
URL: https://github.com/apache/hudi/pull/4212#discussion_r803776863



##
File path: 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java
##
@@ -1737,6 +1739,75 @@ public void testErrorCases() throws Exception {
 }
   }
 
+  /**
+   * Tests no more than 1 clean is scheduled/executed if 
HoodieCompactionConfig.allowMultipleCleanSchedule config is disabled.
+   */
+  @Test
+  public void testMultiClean() throws Exception {

Review comment:
   yeah. I just followed up from where you left :) will try to move it to 
TestCleaner. 

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
##
@@ -721,21 +721,29 @@ public HoodieCleanMetadata clean(String cleanInstantTime, 
boolean skipLocking) t
* @param skipLocking if this is triggered by another parent transaction, 
locking can be skipped.
*/
   public HoodieCleanMetadata clean(String cleanInstantTime, boolean 
scheduleInline, boolean skipLocking) throws HoodieIOException {
-if (scheduleInline) {
-  scheduleTableServiceInternal(cleanInstantTime, Option.empty(), 
TableServiceType.CLEAN);
-}
 LOG.info("Cleaner started");
 final Timer.Context timerContext = metrics.getCleanCtx();
 LOG.info("Cleaned failed attempts if any");

Review comment:
   I have moved it to CleanerUtils.rollbackFailedWrites().
   rollbackFailedwrites() is called in regular rollbacks as well (single 
writer) and not in the context of cleaner

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
##
@@ -254,6 +254,12 @@
   .withDocumentation("The average record size. If not explicitly 
specified, hudi will compute the "
   + "record size estimate compute dynamically based on commit 
metadata. "
   + " This is critical in computing the insert parallelism and 
bin-packing inserts into small files.");
+  
+  public static final ConfigProperty ALLOW_MULTIPLE_CLEANS = 
ConfigProperty
+  .key("hoodie.allow.multiple.cleans")

Review comment:
   In general, its a good practice to name the variable as you read it. for 
eg, the getter method of this variable reads as allowMultipleCleans() which 
goes in line w/ the config key. 
   
   If you check few other existing configs, it does align well. but I want to 
fix that going forward. for eg:hoodie.clean.automatic , variable is called 
AUTO_CLEAN and method is called isAutoClean. I would have preferred the config 
key to be "hoodie.auto.clean". 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4818: [WIP][HUDI-3396][Stacked on 4789] Make sure `BaseFileOnlyViewRelation` only reads projected columns

2022-02-14 Thread GitBox



hudi-bot commented on pull request #4818:
URL: https://github.com/apache/hudi/pull/4818#issuecomment-1039861398


   
   ## CI report:
   
   * f5b4e6c758066cfbbe117827f4e8f4d11162eade Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6014)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4818: [WIP][HUDI-3396][Stacked on 4789] Make sure `BaseFileOnlyViewRelation` only reads projected columns

2022-02-14 Thread GitBox



hudi-bot removed a comment on pull request #4818:
URL: https://github.com/apache/hudi/pull/4818#issuecomment-1039782601


   
   ## CI report:
   
   * f5b4e6c758066cfbbe117827f4e8f4d11162eade Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6014)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3218) Upgrade Avro to 1.10.2

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3218:
-
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-10)

> Upgrade Avro to 1.10.2
> --
>
> Key: HUDI-3218
> URL: https://issues.apache.org/jira/browse/HUDI-3218
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: storage-management, writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Currently, we pull in Avro 1.8.2, which transitively depends on [Paranamer 
> 2.7|[https://mvnrepository.com/artifact/com.thoughtworks.paranamer/paranamer/2.7]]
>  which is failing when used in conjunction w/ Spark 3.2 on JDK8 w/ following 
> exceptions:
>  
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 34826
>   at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563)
>   at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338)
>   at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103)
>   at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90)
>   at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:45)
>   at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:59)
>   at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:59)
>   at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
>   at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
>   at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findConstructorParam$1(BeanIntrospector.scala:59)
>   at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$19(BeanIntrospector.scala:181)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>   at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14(BeanIntrospector.scala:175)
>   at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$14$adapted(BeanIntrospector.scala:174)
>   at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
>   at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
>   at scala.collection.immutable.List.flatMap(List.scala:355)
>   at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.apply(BeanIntrospector.scala:174)
>   at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$._descriptorFor(ScalaAnnotationIntrospectorModule.scala:21)
>   at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.fieldName(ScalaAnnotationIntrospectorModule.scala:29)
>   at 
> com.fasterxml.jackson.module.scala.introspect.ScalaAnnotationIntrospector$.findImplicitPropertyName(ScalaAnnotationIntrospectorModule.scala:77)
>   at 
> com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findImplicitPropertyName(AnnotationIntrospectorPair.java:490)
>   at 
> com.fasterxml.jackson.databind.introspect.POJOPropertiesCollector._addFields(POJOPropertiesCollector.java:380)
>   at 
> com.fasterxml.jackson.databind

[jira] [Updated] (HUDI-1623) Support start_commit_time & end_commit_times for serializable incremental pull

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1623:
-
Sprint: Hudi-Sprint-Feb-14

> Support start_commit_time & end_commit_times for serializable incremental pull
> --
>
> Key: HUDI-1623
> URL: https://issues.apache.org/jira/browse/HUDI-1623
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2955) Upgrade Hadoop to 3.3.x

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2955:
-
Sprint: Hudi-Sprint-Feb-14

> Upgrade Hadoop to 3.3.x
> ---
>
> Key: HUDI-2955
> URL: https://issues.apache.org/jira/browse/HUDI-2955
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-07 at 2.32.51 PM.png
>
>
> According to Hadoop compatibility matrix, this is a pre-requisite to 
> upgrading to JDK11:
> !Screen Shot 2021-12-07 at 2.32.51 PM.png|width=938,height=230!
> [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions]
>  
> *Upgrading Hadoop from 2.x to 3.x*
> [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.x+to+3.x+Upgrade+Efforts]
> Everything (relevant to us) seems to be in a good shape, except Spark 2.2/.3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (HUDI-109) RFC-7 Ability to query older snapshots of data

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-109.
---
Resolution: Fixed

> RFC-7 Ability to query older snapshots of data 
> ---
>
> Key: HUDI-109
> URL: https://issues.apache.org/jira/browse/HUDI-109
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Blocker
> Fix For: 0.11.0
>
>
> At the moment, Hudi allows a client to configure the number of versions to 
> keep for a particular dataset. Depending on the trade-off between storage 
> cost vs number of versions required, one can use Hudi to keep multiple views 
> of the dataset at different points in time. 
> Hudi uses this information to provide incremental consumption of what changed 
> between 2 time units but does not have a simple way to be able to query a 
> dataset as of a particular instant in time. There are multiple use-cases 
> which benefit from such a feature, such as ML training/use-cases against the 
> state of a feature store at a particular instant in time or for auditing 
> system that require to see previous versions of data. 
> Hudi has most of the building blocks to be able to support such queries, we'd 
> like to explore the feasibility & changes required to support it.
> [~semanticbeeng] something along the lines of what you brought up as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-3433) Migrate _hoodie_partition_metadata file to parquet to ensure bigquery is happy

2022-02-14 Thread Vinoth Chandar (Jira)

Vinoth Chandar created HUDI-3433:


 Summary: Migrate _hoodie_partition_metadata file to parquet to 
ensure bigquery is happy
 Key: HUDI-3433
 URL: https://issues.apache.org/jira/browse/HUDI-3433
 Project: Apache Hudi
  Issue Type: Improvement
  Components: writer-core
Reporter: Vinoth Chandar
Assignee: Prashant Wason
 Fix For: 0.11.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] hudi-bot removed a comment on pull request #4817: [HUDI-3429] Support clustering scheduleAndExecute for hudi-cli and add clustering-cli Tests

2022-02-14 Thread GitBox



hudi-bot removed a comment on pull request #4817:
URL: https://github.com/apache/hudi/pull/4817#issuecomment-1039782583


   
   ## CI report:
   
   * 924c0fd27a015e8b7a0c358f2f978e1c95da01cb Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6013)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4817: [HUDI-3429] Support clustering scheduleAndExecute for hudi-cli and add clustering-cli Tests

2022-02-14 Thread GitBox



hudi-bot commented on pull request #4817:
URL: https://github.com/apache/hudi/pull/4817#issuecomment-1039858760


   
   ## CI report:
   
   * 924c0fd27a015e8b7a0c358f2f978e1c95da01cb Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6013)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3396) Make sure Spark reads only Projected Columns for both MOR/COW

2022-02-14 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3396:
--
Epic Link: HUDI-1297

> Make sure Spark reads only Projected Columns for both MOR/COW
> -
>
> Key: HUDI-3396
> URL: https://issues.apache.org/jira/browse/HUDI-3396
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available, spark
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2022-02-08 at 4.58.12 PM.png
>
>
> Spark Relation impl for MOR table seem to have following issues:
>  * `requiredSchemaParquetReader` still leverages full table schema, entailing 
> that we're fetching *all* columns from Parquet (even though the query might 
> just be projecting a handful) 
>  * `fullSchemaParquetReader` is always reading full-table to (presumably)be 
> able to do merging which might access arbitrary key-fields. This seems 
> superfluous, since we can only fetch the fields designated as 
> `PRECOMBINE_FIELD_NAME` as well as `RECORDKEY_FIELD_NAME`. We won't be able 
> to do that if either of the following is true:
>  ** Virtual Keys are used (key-gen will require whole payload)
>  ** Non-trivial merging strategy is used requiring whole record payload
>  * We don't seem to properly push-down data filters to Parquet reader when 
> reading whole table
>  
> AIs
>  * Make sure COW tables _only_ read projected columns
>  * Make sure MOR tables _only_ read projected columns, except when either of
>  ** Non-standard Record Payload class is used (for merging) 
>  ** Virtual keys are used
>  * +Write tests for Spark DataSource asserting that only projected columns 
> are being fetched+
>  
> !Screen Shot 2022-02-08 at 4.58.12 PM.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2732) Spark Datasource V2 integration RFC

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2732:
-
Reviewers: Vinoth Chandar  (was: Raymond Xu, Vinoth Chandar)

> Spark Datasource V2 integration RFC 
> 
>
> Key: HUDI-2732
> URL: https://issues.apache.org/jira/browse/HUDI-2732
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Reporter: leesf
>Assignee: leesf
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-3432) Test restore scenarios with MT table

2022-02-14 Thread Vinoth Chandar (Jira)

Vinoth Chandar created HUDI-3432:


 Summary: Test restore scenarios with MT table
 Key: HUDI-3432
 URL: https://issues.apache.org/jira/browse/HUDI-3432
 Project: Apache Hudi
  Issue Type: Task
Reporter: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (HUDI-3432) Test restore scenarios with MT table

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-3432:


Assignee: sivabalan narayanan

> Test restore scenarios with MT table
> 
>
> Key: HUDI-3432
> URL: https://issues.apache.org/jira/browse/HUDI-3432
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3432) Test restore scenarios with MT table

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3432:
-
Fix Version/s: 0.11.0

> Test restore scenarios with MT table
> 
>
> Key: HUDI-3432
> URL: https://issues.apache.org/jira/browse/HUDI-3432
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3300) Timeline server FSViewManager should avoid point lookup for metadata file partition

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3300:
-
Sprint: Hudi-Sprint-Feb-14

> Timeline server FSViewManager should avoid point lookup for metadata file 
> partition
> ---
>
> Key: HUDI-3300
> URL: https://issues.apache.org/jira/browse/HUDI-3300
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Manoj Govindassamy
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> When inline reading is enabled, that is 
> hoodie.metadata.enable.full.scan.log.files = false, 
> MetadataMergedLogRecordReader doesn't cache the file listings records via the 
> ExternalSpillableMap. So, every file listing will lead to re-reading of 
> metadata files partition log and base files. Since files partition is less in 
> size, even when inline reading is enabled, the TimelineServer should 
> construct the FSViewManager with inline reading disabled for metadata files 
> partition. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3307) Implement a repair tool for bring back table into a clean state

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3307:
-
Sprint: Hudi-Sprint-Feb-14

> Implement a repair tool for bring back table into a clean state
> ---
>
> Key: HUDI-3307
> URL: https://issues.apache.org/jira/browse/HUDI-3307
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Implement a repair tool for bring back table into a clean state, in case 
> metadata table corrupts the data table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] zhangyue19921010 commented on pull request #4810: [HUDI-3421]Pending clustering may break AbstractTableFileSystemView#getxxBaseFile()

2022-02-14 Thread GitBox



zhangyue19921010 commented on pull request #4810:
URL: https://github.com/apache/hudi/pull/4810#issuecomment-1039856405


   Hi @codope and @satishkotha Sorry to bother you. Would you mind to take a 
look at his patch?
   Thanks a lot :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3376) Understand the implication of mismatch between metadata table vs FS during clean

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3376:
-
Sprint: Hudi-Sprint-Feb-14

> Understand the implication of mismatch between metadata table vs FS during 
> clean
> 
>
> Key: HUDI-3376
> URL: https://issues.apache.org/jira/browse/HUDI-3376
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3301) MergedLogRecordReader inline reading should be stateless and thread safe

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3301:
-
Sprint: Hudi-Sprint-Feb-14

> MergedLogRecordReader inline reading should be stateless and thread safe
> 
>
> Key: HUDI-3301
> URL: https://issues.apache.org/jira/browse/HUDI-3301
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Manoj Govindassamy
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: HUDI-bug
> Fix For: 0.11.0
>
>
> Metadata table inline reading (enable.full.scan.log.files = false) today 
> alters instance member fields and not thread safe.
>  
> When the inline reading is enabled, HoodieMetadataMergedLogRecordReader 
> doesn't do full read of log and base files and doesn't fill in the 
> ExternalSpillableMap records cache. Each getRecordsByKeys() thereby will 
> re-read the log and base files by design. But the issue here is this reading 
> alters the instance members and the filled in records are relevant only for 
> that request. Any concurrent getRecordsByKeys() is also modifying the member 
> variable leading to NPE.
>  
> To avoid this, a temporary fix of making getRecordsByKeys() a synchronized 
> method has been pushed to master. But this fix doesn't solve all usecases. We 
> need to make the whole class stateless and thread safe for inline reading.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-1180) Upgrade HBase to 2.x

2022-02-14 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492362#comment-17492362
 ] 

Vinoth Chandar commented on HUDI-1180:
--

apache hudi can move to HBase 2.x , as long as the HFiles are backwards 
compatible ile able to read older HFiles written using HBase 1.x. 

[~balajeeUber] [~pwason] et al. will deal with different Hudi version running 
for hbase-client. We can collaborate on an adapter layer down the line if need 
be.

> Upgrade HBase to 2.x
> 
>
> Key: HUDI-1180
> URL: https://issues.apache.org/jira/browse/HUDI-1180
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Wenning Ding
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Trying to upgrade HBase to 2.3.3 but ran into several issues.
> According to the Hadoop version support matrix: 
> [http://hbase.apache.org/book.html#hadoop], also need to upgrade Hadoop to 
> 2.8.5+.
>  
> There are several API conflicts between HBase 2.2.3 and HBase 1.2.3, we need 
> to resolve this first. After resolving conflicts, I am able to compile it but 
> then I ran into a tricky jetty version issue during the testing:
> {code:java}
> [ERROR] TestHBaseIndex.testDelete()  Time elapsed: 4.705 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdate()  Time elapsed: 0.174 
> s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdateWithRollback()  Time 
> elapsed: 0.076 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSmallBatchSize()  Time elapsed: 0.122 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTagLocationAndDuplicateUpdate()  Time elapsed: 
> 0.16 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalGetsBatching()  Time elapsed: 1.771 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalPutsBatching()  Time elapsed: 0.082 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> 34206 [Thread-260] WARN  
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner  - DirectoryScanner: 
> shutdown has been called
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager  - 
> IncrementalBlockReportManager interrupted
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.DataNode  - Ending block pool service 
> for: Block pool BP-1058834949-10.0.0.2-1597189606506 (Datanode Uuid 
> cb7bd8aa-5d79-4955-b1ec-bdaf7f1b6431) service to localhost/127.0.0.1:55924
> 34246 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data1/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 34247 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data2/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 37192 [HBase-Metrics2-1] WARN  org.apache.hadoop.metrics2.impl.MetricsConfig  
> - Cannot locate configuration: tried 
> hadoop-metrics2-datanode.properties,hadoop-metrics2.properties
> 43904 
> [master/iad1-ws-cor-r12:0:becomeActiveMaster-SendThread(localhost:58768)] 
> WARN  org.apache.zookeeper.ClientCnxn  - Session 0x173dfeb0c8b0004 for server 
> null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> [INFO] 
> [INFO] Results:
> [INFO] 
> [ERROR] Errors: 
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.Sess

[GitHub] [hudi] nsivabalan commented on pull request #4385: [HUDI-1436]: provided option to trigger clean every nth commit

2022-02-14 Thread GitBox



nsivabalan commented on pull request #4385:
URL: https://github.com/apache/hudi/pull/4385#issuecomment-1039852714


   @pratyakshsharma : let me know once the feedback is addressed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #3646: [HUDI-349]: Added new cleaning policy based on number of hours

2022-02-14 Thread GitBox



nsivabalan commented on pull request #3646:
URL: https://github.com/apache/hudi/pull/3646#issuecomment-1039852572


   @pratyakshsharma : once you address the feedback, let me know. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (27bd7b5 -> cb6ca7f)

2022-02-14 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 27bd7b5  [HUDI-1576] Make archiving an async service (#4795)
 add cb6ca7f  [HUDI-3204] fix problem that spark on TimestampKeyGenerator 
has no re… (#4714)

No new revisions were added by this update.

Summary of changes:
 .../keygen/TimestampBasedAvroKeyGenerator.java | 35 ++
 .../keygen/parser/BaseHoodieDateTimeParser.java|  6 +-
 .../hudi/keygen/parser/HoodieDateTimeParser.java   | 26 +++
 .../keygen/TestTimestampBasedKeyGenerator.java | 18 ++---
 .../hudi/common/table/HoodieTableConfig.java   | 13 
 .../hudi/common/table/HoodieTableMetaClient.java   | 35 ++
 .../hudi/keygen/constant/KeyGeneratorOptions.java  | 23 +++
 .../org/apache/hudi/table/HoodieTableFactory.java  | 13 ++--
 .../apache/hudi/table/TestHoodieTableFactory.java  |  7 +-
 .../scala/org/apache/hudi/HoodieFileIndex.scala| 57 +--
 .../org/apache/hudi/HoodieSparkSqlWriter.scala | 17 -
 .../apache/hudi/MergeOnReadSnapshotRelation.scala  |  6 +-
 .../org/apache/hudi/TestHoodieFileIndex.scala  |  8 ++-
 .../apache/hudi/functional/TestCOWDataSource.scala | 62 -
 .../hudi/functional/TestCOWDataSourceStorage.scala |  4 +-
 .../apache/hudi/functional/TestMORDataSource.scala | 80 ++
 16 files changed, 337 insertions(+), 73 deletions(-)

[GitHub] [hudi] nsivabalan merged pull request #4714: [HUDI-3204] fix problem that spark on TimestampKeyGenerator has no re…

2022-02-14 Thread GitBox



nsivabalan merged pull request #4714:
URL: https://github.com/apache/hudi/pull/4714


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-2597) Improve code quality around Generics with Java 8

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2597:
-
Sprint: Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7  (was: Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14)

> Improve code quality around Generics with Java 8
> 
>
> Key: HUDI-2597
> URL: https://issues.apache.org/jira/browse/HUDI-2597
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2439) Refactor table.action.commit package (CommitActionExecutors) in hudi-client module

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2439:
-
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-18)

> Refactor table.action.commit package (CommitActionExecutors) in hudi-client 
> module
> --
>
> Key: HUDI-2439
> URL: https://issues.apache.org/jira/browse/HUDI-2439
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3042) Refactor clustering action in hudi-client module to use HoodieData abstraction

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3042:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31)

> Refactor clustering action in hudi-client module to use HoodieData abstraction
> --
>
> Key: HUDI-3042
> URL: https://issues.apache.org/jira/browse/HUDI-3042
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: sev:high
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3334) Unable to merge HoodieMetadataPayload during partition listing

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3334:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> Unable to merge HoodieMetadataPayload during partition listing
> --
>
> Key: HUDI-3334
> URL: https://issues.apache.org/jira/browse/HUDI-3334
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Blocker
>  Labels: HUDI-bug
> Fix For: 0.11.0
>
>
> When running the integration test with `mvn -Pintegration-tests verify`, the 
> test failed due to retrieving list of partition from metadata table.
> Stacktrace:
> {code:java}
> Caused by: org.apache.hudi.exception.HoodieException: Error fetching 
> partition paths from metadata table
>     at 
> org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:299)
>     at 
> org.apache.hudi.HoodieTableFileIndexBase.getAllQueryPartitionPaths(HoodieTableFileIndexBase.scala:233)
>     at 
> org.apache.hudi.HoodieTableFileIndexBase.loadPartitionPathFiles(HoodieTableFileIndexBase.scala:195)
>     at 
> org.apache.hudi.HoodieTableFileIndexBase.refresh0(HoodieTableFileIndexBase.scala:108)
>     at 
> org.apache.hudi.HoodieTableFileIndexBase.(HoodieTableFileIndexBase.scala:88)
>     at 
> org.apache.hudi.hadoop.HiveHoodieTableFileIndex.(HiveHoodieTableFileIndex.java:52)
>     at 
> org.apache.hudi.hadoop.HoodieFileInputFormatBase.listStatusForSnapshotMode(HoodieFileInputFormatBase.java:170)
>     at 
> org.apache.hudi.hadoop.HoodieFileInputFormatBase.listStatus(HoodieFileInputFormatBase.java:141)
>     at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
>     at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442)
>     at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561)
>     at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:330)
>     at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322)
>     at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:198)
>     at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
>     at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
>     at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
>     at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
>     at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
>     at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
>     at 
> org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:411)
>     at 
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:151)
>     at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
>     at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
>     at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
>     at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
>     at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
>     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
>     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1232)
>     at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:255)
>     ... 11 more
> Caused by: org.apache.hudi.exception.HoodieMetadataException: Failed to 
> retrieve list of partition from metadata
>     at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:100)
>     at 
> org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:297)
>     ... 47 more
> Caused by: org.apache.hudi.exception.HoodieException: Exception when reading 
> log file 
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:333)
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:179)
>     at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:103)
>     at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.(HoodieMetadataMergedLogRecordReader.java:71)
>

[jira] [Updated] (HUDI-2931) Add a config to turn off all table services

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2931:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> Add a config to turn off all table services
> ---
>
> Key: HUDI-2931
> URL: https://issues.apache.org/jira/browse/HUDI-2931
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Rajesh Mahindra
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Have a single config file to turn off all table services -> Apply it to use 
> cases, such as kafka connect participant, that shld not trigger any table 
> services



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3368) Support metadata bloom index for secondary keys

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3368:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> Support metadata bloom index for secondary keys
> ---
>
> Key: HUDI-3368
> URL: https://issues.apache.org/jira/browse/HUDI-3368
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Support for metadata table based bloom index for secondary keys 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2757) Support AWS Glue API for metastore sync

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2757:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> Support AWS Glue API for metastore sync
> ---
>
> Key: HUDI-2757
> URL: https://issues.apache.org/jira/browse/HUDI-2757
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: hive
>Reporter: Raymond Xu
>Assignee: Rajesh Mahindra
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Raised in https://github.com/apache/hudi/issues/3954



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3430) Deltastreamers Spark app hangs after DeltaSync being shut down

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3430:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> Deltastreamers Spark app hangs after DeltaSync being shut down
> --
>
> Key: HUDI-3430
> URL: https://issues.apache.org/jira/browse/HUDI-3430
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3381) Rebase `HoodieMergeHandle` to operate on `HoodieRecord`

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3381:
-
Sprint: Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Rebase `HoodieMergeHandle` to operate on `HoodieRecord`
> ---
>
> Key: HUDI-3381
> URL: https://issues.apache.org/jira/browse/HUDI-3381
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> From RFC-46:
> `HoodieWriteHandle`s will be  
>    1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro 
> conversion)
>    2. Using Combining API engine to merge records (when necessary) 
>    3. Passes `HoodieRecord` as is to `FileWriter



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3380) Rebase `HoodieDataBlock`s to operate on `HoodieRecord`

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3380:
-
Sprint: Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Rebase `HoodieDataBlock`s to operate on `HoodieRecord`
> --
>
> Key: HUDI-3380
> URL: https://issues.apache.org/jira/browse/HUDI-3380
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> HoodieDataBlock implementations for Avro, Parquet, HFile have to be rebased 
> on HoodieRecord to unblock HUDI-3379



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1127) Handling late arriving Deletes

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1127:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Handling late arriving Deletes
> --
>
> Key: HUDI-1127
> URL: https://issues.apache.org/jira/browse/HUDI-1127
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer, writer-core
>Affects Versions: 0.9.0
>Reporter: Bhavani Sudha
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: sev:high
> Fix For: 0.11.0
>
>
> Recently I was working on a [PR|https://github.com/apache/hudi/pull/1704] to 
> enhance OverwriteWithLatestAvroPayload class to consider records in storage 
> when merging. Briefly, this class will ignore older updates if the record in 
> storage is the latest one ( based on the Precombine field). 
> Based on this, the expectation is that we handle any write operation that 
> should be dealt with the same way - if they are older they should be ignored. 
> While at this, I identified that we cannot handle all Deletes the same way. 
> This is because we process deletes in two ways mainly -
>  * by adding and enabling a metadata field  `_hoodie_is_deleted` to our in 
> the original record and sending it as an UPSERT operation.
>  * by using an empty payload using the EmptyHoodieRecordPayload and sending 
> the write as a DELETE operation. 
> While the former has ordering field and can be processed as expected (older 
> deletes will be ignored), the later does not have any ordering field to 
> identify if its an older delete or not and hence will let the older delete to 
> go through.
> Just opening this issue to track this gap. We would need to identify what is 
> the right choice here and fix as needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3203) Meta bloom index should use the bloom filter type property to construct back the bloom filter instant

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3203:
-
Sprint: Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Meta bloom index should use the bloom filter type property to construct back 
> the bloom filter instant
> -
>
> Key: HUDI-3203
> URL: https://issues.apache.org/jira/browse/HUDI-3203
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1492) Enhance DeltaWriteStat with block level metadata correctly for storage schemes that support appends

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1492:
-
Sprint: Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Enhance DeltaWriteStat with block level metadata correctly for storage 
> schemes that support appends
> ---
>
> Key: HUDI-1492
> URL: https://issues.apache.org/jira/browse/HUDI-1492
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Vinoth Chandar
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Current implementation simply uses the
> {code:java}
> String pathWithPartition = hoodieWriteStat.getPath(); {code}
> to write the metadata table. this is problematic, if the delta write was 
> merely an append. and can technically add duplicate files into the metadata 
> table 
> (not sure if this is a problem per se. but filing a Jira to track and either 
> close/fix ) 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3386) DROP INDEX comand

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3386:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> DROP INDEX comand
> -
>
> Key: HUDI-3386
> URL: https://issues.apache.org/jira/browse/HUDI-3386
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Drop or delete the index.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3354) Rebase `HoodieRealtimeRecordReader` to return `HoodieRecord`

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3354:
-
Sprint: Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Rebase `HoodieRealtimeRecordReader` to return `HoodieRecord`
> 
>
> Key: HUDI-3354
> URL: https://issues.apache.org/jira/browse/HUDI-3354
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> From RFC-46:
> `HoodieRealtimeRecordReader`s 
> 1. API will be returning opaque `HoodieRecord` instead of raw Avro payload



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3349) Revisit HoodieRecord API to be able to replace HoodieRecordPayload

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3349:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Revisit HoodieRecord API to be able to replace HoodieRecordPayload
> --
>
> Key: HUDI-3349
> URL: https://issues.apache.org/jira/browse/HUDI-3349
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> From RFC-46:
> To promote `HoodieRecord` to become a standardized API of interacting with a 
> single record, we need:
>  # Rebase usages of `HoodieRecordPayload` w/ `HoodieRecord`
>  # Implement new standardized record-level APIs (like `getPartitionKey` , 
> `getRecordKey`, etc) in `HoodieRecord`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3225) RFC for Async Metadata Index

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3225:
-
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> RFC for Async Metadata Index
> 
>
> Key: HUDI-3225
> URL: https://issues.apache.org/jira/browse/HUDI-3225
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2965) Fix layout optimization to appropriately handle nested columns references

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2965:
-
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  
(was: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Fix layout optimization to appropriately handle nested columns references
> -
>
> Key: HUDI-2965
> URL: https://issues.apache.org/jira/browse/HUDI-2965
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Currently Layout Optimization does only work for top-level columns being 
> specified as columns to be orderedBy.
>  
> We need to make sure it works correctly for the case when the the field 
> reference is specified in the configuration as well (like "a.b.c", 
> referencing the field `c` w/in `b` sub-object of the top-level "c" column)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3284) Restore hudi-presto-bundle changes and upgrade presto version in docker setup

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3284:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Restore hudi-presto-bundle changes and upgrade presto version in docker setup
> -
>
> Key: HUDI-3284
> URL: https://issues.apache.org/jira/browse/HUDI-3284
> Project: Apache Hudi
>  Issue Type: Task
>  Components: trino-presto
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> For more details, https://github.com/apache/hudi/pull/4646



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3088) Make Spark 3 the default profile for build and test

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3088:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Make Spark 3 the default profile for build and test
> ---
>
> Key: HUDI-3088
> URL: https://issues.apache.org/jira/browse/HUDI-3088
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Forward Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> By default, when people check out the code, they should have activated spark 
> 3 for the repo. Also all tests should be running against the latest supported 
> spark version. Correspondingly the default scala version becomes 2.12 and the 
> default parquet version 1.12.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2747) Fix hudi cli metadata commands

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2747:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> Fix hudi cli metadata commands
> --
>
> Key: HUDI-2747
> URL: https://issues.apache.org/jira/browse/HUDI-2747
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: HUDI-bug
> Fix For: 0.11.0
>
>
> Fix hudi cli metadata commands



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3142) Metadata new Indices initialization during table creation

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3142:
-
Sprint: Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Metadata new Indices initialization during table creation 
> --
>
> Key: HUDI-3142
> URL: https://issues.apache.org/jira/browse/HUDI-3142
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Metadata table when created for the first time, checks if index 
> initialization is needed by comparing with the data table timeline. Today the 
> initialization only takes care of metadata files partition. We need to do 
> similar initialization for all the new index partitions - bloom_filters, 
> col_stats.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3382) Support removal of bloom and column stats indexes

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3382:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> Support removal of bloom and column stats indexes
> -
>
> Key: HUDI-3382
> URL: https://issues.apache.org/jira/browse/HUDI-3382
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2732) Spark Datasource V2 integration RFC

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2732:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Spark Datasource V2 integration RFC 
> 
>
> Key: HUDI-2732
> URL: https://issues.apache.org/jira/browse/HUDI-2732
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Reporter: leesf
>Assignee: leesf
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3356) Conversion of write stats to metadata index records should use HoodieData throughout

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3356:
-
Sprint: Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Conversion of write stats to metadata index records should use HoodieData 
> throughout
> 
>
> Key: HUDI-3356
> URL: https://issues.apache.org/jira/browse/HUDI-3356
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> HoodieMetadataTableUtil convertMetadataToRecords() converts all write stats 
> to metadata index records to List of HoodieRecords before passing on them to 
> engine specific commit() to prep records. This can OOM driver. We need to use 
> HoodieData throughout. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2597) Improve code quality around Generics with Java 8

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2597:
-
Sprint: Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Improve code quality around Generics with Java 8
> 
>
> Key: HUDI-2597
> URL: https://issues.apache.org/jira/browse/HUDI-2597
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3365) Make sure Metadata Records always bear full file-size

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3365:
-
Sprint: Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Make sure Metadata Records always bear full file-size
> -
>
> Key: HUDI-3365
> URL: https://issues.apache.org/jira/browse/HUDI-3365
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: HUDI-bug, pull-request-available
> Fix For: 0.11.0
>
>
> Currently, when log-file is appended the Metadata Table record w/ the size of 
> appended delta will be submitted to MT. 
> MT, in turn, will sum it up with whatever record is currently persisted there 
> at the moment (REF: 
> [https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java#L227)]
>  
> This is problematic in multiple ways:
>  # We're abusing FileStatus interface which unequivocally relates the size as 
> a full file-size.
>  # MT receiving new record can't determine whether it's received a delta or a 
> new record that have to override the old one. As such, it stick to the 
> protocol, that if record already exists it will treat new one as a delta.
> This behavior is very implicit, and there are currently at least one bug 
> where actually full file-size is provided, while it would lead to the 
> incorrect file size being stored in MT.
>  
> Proposal: Unify the data flow and always provide full, up-to-date file-size 
> to the MT. Even in the log-file appending flow, we do have a way to 
> reconstruct full file-size (from `AppendResult`, w/o additional 
> `getFileStatus` necessary
>  
> Currently, when log-file is appended (on FS that supports it), only the 
> appended delta will be submitted w/in records to MT



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3258) Support multiple metadata index partitions - bloom and column stats

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3258:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> Support multiple metadata index partitions - bloom and column stats
> ---
>
> Key: HUDI-3258
> URL: https://issues.apache.org/jira/browse/HUDI-3258
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: sev:normal
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3161) Add Call Produce Command for spark sql

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3161:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> Add Call Produce Command for spark sql
> --
>
> Key: HUDI-3161
> URL: https://issues.apache.org/jira/browse/HUDI-3161
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> example
> {code:java}
> // code placeholder
> # Produce1
> call show_commits_metadata(table => 'test_hudi_table');
> commit_time   action  partition   file_id previous_commit num_writes  
> num_inserts num_deletes num_update_writes   total_errors
> total_log_blockstotal_corrupt_logblocks total_rollback_blocks   
> total_log_records   total_updated_records_compacted total_bytes_written
> 20220109225319449 commit  dt=2021-05-03   
> d0073a12-085d-4f49-83e9-402947e7e90a-0  null1   1   0   0 
>   0   0   0   0   0   0   435349
> 20220109225311742 commit  dt=2021-05-02   
> b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0  20220109214830592   1   1 
>   0   0   0   0   0   0   0   0   435340
> 20220109225301429 commit  dt=2021-05-01   
> 0d7298b3-6b55-4cff-8d7d-b0772358b78a-0  20220109214830592   1   1 
>   0   0   0   0   0   0   0   0   435340
> 20220109214830592 commit  dt=2021-05-01   
> 0d7298b3-6b55-4cff-8d7d-b0772358b78a-0  20220109191631015   0   0 
>   1   0   0   0   0   0   0   0   432653
> 20220109214830592 commit  dt=2021-05-02   
> b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0  20220109191648181   0   0 
>   1   0   0   0   0   0   0   0   432653
> 20220109191648181 commit  dt=2021-05-02   
> b3b32bac-8a44-4c4d-b433-0cb1bf620f23-0  null1   1   0   0 
>   0   0   0   0   0   0   435341
> 20220109191631015 commit  dt=2021-05-01   
> 0d7298b3-6b55-4cff-8d7d-b0772358b78a-0  null1   1   0   0 
>   0   0   0   0   0   0   435341
> Time taken: 0.844 seconds, Fetched 7 row(s)
> # Produce2
> call rollback_to_instant(table => 'test_hudi_table', instant_time => 
> '20220109225319449');
> rollback_result
> true
> Time taken: 5.038 seconds, Fetched 1 row(s)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3166) Implement new HoodieIndex based on metadata indices

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3166:
-
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Implement new HoodieIndex based on metadata indices 
> 
>
> Key: HUDI-3166
> URL: https://issues.apache.org/jira/browse/HUDI-3166
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index, metadata
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: metadata, pull-request-available
> Fix For: 0.11.0
>
>
> A new HoodieIndex implementation working based off indices from the metadata 
> table. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3404) Disable metadata table by config with conditions

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3404:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> Disable metadata table by config with conditions
> 
>
> Key: HUDI-3404
> URL: https://issues.apache.org/jira/browse/HUDI-3404
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Affects Versions: 0.11.0
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3397) Make sure Spark RDDs triggering actual FS activity are only dereferenced once

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3397:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> Make sure Spark RDDs triggering actual FS activity are only dereferenced once
> -
>
> Key: HUDI-3397
> URL: https://issues.apache.org/jira/browse/HUDI-3397
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: spark
> Fix For: 0.11.0
>
>
> Currently, RDD `collect()` operation is treated quite loosely and there are 
> multiple flows which used to dereference RDDs (for ex, through `collect`, 
> `count`, etc) that way triggering the same operations being carried out 
> multiple times, occasionally duplicating the output already persisted on FS.
> Check out HUDI-3370 for recent example.
> NOTE: Even though Spark caching is supposed to make sure that we aren't 
> writing to FS multiple times, we can't solely rely on caching to guarantee 
> exactly once execution.
> Instead, we should make sure that RDDs are only dereferenced {*}once{*}, w/in 
> "commit" operation and all the other operations are only relying on 
> _derivative_ data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3394) Fail to release in-process lock due to IllegalMonitorStateException

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3394:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> Fail to release in-process lock due to IllegalMonitorStateException
> ---
>
> Key: HUDI-3394
> URL: https://issues.apache.org/jira/browse/HUDI-3394
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: HUDI-bug
> Fix For: 0.11.0
>
>
> Environment: Deltastreamer continuous mode writing MOR table with upserts, 
> with async Compaction, and Cleaner, archival and metadata table enabled.  
> InProcessLockProvider is used.
> {code:java}
> hoodie.write.concurrency.mode=optimistic_concurrency_control
> hoodie.cleaner.policy.failed.writes=LAZY
> hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
>  {code}
> Root cause: one thread writing deltacommit is holding the lock.  The async 
> cleaner fails to grab the lock in a minute or so due to the ongoing 
> deltacommit, and then tries to unlock which throws 
> IllegalMonitorStateException.
> Full logs and stacktrace:
>  
> {code:java}
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRollbackException: Failed to rollback 
> /Users/ethan/Work/data/hudi/metadata_test_ds_mor_continuous_4 commits 
> 20220207234228129
>   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>   at 
> org.apache.hudi.async.HoodieAsyncService.waitForShutdown(HoodieAsyncService.java:89)
>   at 
> org.apache.hudi.client.AsyncCleanerService.waitForCompletion(AsyncCleanerService.java:71)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.autoCleanOnCommit(BaseHoodieWriteClient.java:523)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.postCommit(BaseHoodieWriteClient.java:462)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:200)
>   at 
> org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:127)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:578)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:323)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:643)
>   at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieRollbackException: Failed to 
> rollback /Users/ethan/Work/data/hudi/metadata_test_ds_mor_continuous_4 
> commits 20220207234228129
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.rollback(BaseHoodieWriteClient.java:666)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.rollbackFailedWrites(BaseHoodieWriteClient.java:971)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.rollbackFailedWrites(BaseHoodieWriteClient.java:954)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.lambda$clean$33796fd2$1(BaseHoodieWriteClient.java:736)
>   at 
> org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:135)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:735)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:703)
>   at 
> org.apache.hudi.client.AsyncCleanerService.lambda$startService$0(AsyncCleanerService.java:51)
>   ... 4 more
> Caused by: org.apache.hudi.exception.HoodieLockException: Thread 
> pool-26-thread-1 FAILED_TO_RELEASE in-process lock.
>   at 
> org.apache.hudi.client.transaction.lock.InProcessLockProvider.unlock(InProcessLockProvider.java:97)
>   at 
> org.apache.hudi.client.transaction.lock.LockManager.unlock(LockManager.java:88)
>   at 
> org.apache.hudi.client.transaction.TransactionManager.endTransaction(TransactionManager.java:80)
>   at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:252)
>   at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:122)
>   at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:144)
>

[jira] [Updated] (HUDI-3396) Make sure Spark reads only Projected Columns for both MOR/COW

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3396:
-
Sprint: Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Feb-7)

> Make sure Spark reads only Projected Columns for both MOR/COW
> -
>
> Key: HUDI-3396
> URL: https://issues.apache.org/jira/browse/HUDI-3396
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available, spark
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2022-02-08 at 4.58.12 PM.png
>
>
> Spark Relation impl for MOR table seem to have following issues:
>  * `requiredSchemaParquetReader` still leverages full table schema, entailing 
> that we're fetching *all* columns from Parquet (even though the query might 
> just be projecting a handful) 
>  * `fullSchemaParquetReader` is always reading full-table to (presumably)be 
> able to do merging which might access arbitrary key-fields. This seems 
> superfluous, since we can only fetch the fields designated as 
> `PRECOMBINE_FIELD_NAME` as well as `RECORDKEY_FIELD_NAME`. We won't be able 
> to do that if either of the following is true:
>  ** Virtual Keys are used (key-gen will require whole payload)
>  ** Non-trivial merging strategy is used requiring whole record payload
>  * We don't seem to properly push-down data filters to Parquet reader when 
> reading whole table
>  
> AIs
>  * Make sure COW tables _only_ read projected columns
>  * Make sure MOR tables _only_ read projected columns, except when either of
>  ** Non-standard Record Payload class is used (for merging) 
>  ** Virtual keys are used
>  * +Write tests for Spark DataSource asserting that only projected columns 
> are being fetched+
>  
> !Screen Shot 2022-02-08 at 4.58.12 PM.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3221) Support querying a table as of a savepoint

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3221:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Support querying a table as of a savepoint
> --
>
> Key: HUDI-3221
> URL: https://issues.apache.org/jira/browse/HUDI-3221
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: hive, reader-core, spark, writer-core
>Reporter: Ethan Guo
>Assignee: Forward Xu
>Priority: Blocker
>  Labels: pull-request-available, user-support-issues
> Fix For: 0.11.0
>
>
> Right now point-in-time queries are limited to what's retained by the 
> cleaner. If we fix this and expose via SQL, then it's a gap we close.
> Dataframe read path support this option but not for SQL read path
> [https://hudi.apache.org/docs/quick-start-guide/#time-travel-query]
> SparkSQL Syntax
> {code:java}
> // code placeholder
> SELECT * FROM A.B TIMESTAMP AS OF 1643119574;
> SELECT * FROM A.B TIMESTAMP AS OF '2019-01-29 00:37:58';
> SELECT * FROM A.B VERSION AS OF 'Snapshot123456789';{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3246) Blog on Kafka Connect Sink for Hudi

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3246:
-
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Blog on Kafka Connect Sink for Hudi
> ---
>
> Key: HUDI-3246
> URL: https://issues.apache.org/jira/browse/HUDI-3246
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs, kafka-connect
>Reporter: Ethan Guo
>Assignee: Rajesh Mahindra
>Priority: Blocker
>  Labels: kafka-connect
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2930) Rollbacks are not archived when metadata table is enabled

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2930:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Rollbacks are not archived when metadata table is enabled
> -
>
> Key: HUDI-2930
> URL: https://issues.apache.org/jira/browse/HUDI-2930
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: HUDI-bug
> Fix For: 0.11.0
>
>
> I run bulk inserts into COW table using DeltaStreamer continuous mode and I 
> observed that the rollbacks are not archived.  There were commits in between 
> these old rollbacks but after the archival process kicks in, the old 
> rollbacks are still in the active timeline while the other commits are 
> archived.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2925) Cleaner may attempt to delete the same file twice when metadata table is enabled

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2925:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Cleaner may attempt to delete the same file twice when metadata table is 
> enabled
> 
>
> Key: HUDI-2925
> URL: https://issues.apache.org/jira/browse/HUDI-2925
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: HUDI-bug, core-flow-ds, pull-request-available, sev:high
> Fix For: 0.11.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> This issue happens only when TimelineServer is disabled (reason in next 
> comment). Our pipelines execute a write (insert or upsert) along with an 
> asynchronous clean. Metadata table is enabled.
>  
> Assume the timelines are as follows:
> Dataset:   100.commit        101.commit   102.clean.inflight
> Metadata: 100.deltacomit  
> (this happened as the pipeline failed due to non-HUDI  issues which executing 
> 101 and 102)
>  
> In the next run of the pipeline some more data is available  so a commit will 
> take place (103.commit.requested). Along with it, an asynchronous clean 
> starts (104.clean.requested). The [BaseCleanActionExecutor detected 
> previously unfinished 
> clean|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java#L231]
>  (102.clean.inflight) and attempts to do it first. So the order of cleans 
> will be 102.clean followed by 104.clean.
>  
> 102.clean => Suppose this deletes files from 90.commit
> 104.clean  => This should delete files from 91.commit
>  
> The issue is that while executing 104.clean, the filesystemview is still the 
> one which was used during 102.clean (i.e. post clean the file system view is 
> not synced). When metadata table is enabled, HoodieMetadataFileSystemView is 
> used which has the metadata reader inside it. This metadata reader opens the 
> metadata table at a particular time instant (will be 101.commit as that was 
> the last completed action). Even after 102.clean is completed, the 
> HoodieMetadataFileSystemView is still using the cached metadata reader. 
> Hence, the reader still returns files from 90.commit which have already been 
> deleted by 102.clean.  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3207) Hudi Trino connector PR review

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3207:
-
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Hudi Trino connector PR review
> --
>
> Key: HUDI-3207
> URL: https://issues.apache.org/jira/browse/HUDI-3207
> Project: Apache Hudi
>  Issue Type: Task
>  Components: trino-presto
>Reporter: Ethan Guo
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> https://github.com/trinodb/trino/pull/10228



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2973) Rewrite/re-publish RFC for Data skipping index

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2973:
-
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Rewrite/re-publish RFC for Data skipping index
> --
>
> Key: HUDI-2973
> URL: https://issues.apache.org/jira/browse/HUDI-2973
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2809) Introduce a checksum mechanism for validating hoodie.properties

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2809:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Introduce a checksum mechanism for validating hoodie.properties
> ---
>
> Key: HUDI-2809
> URL: https://issues.apache.org/jira/browse/HUDI-2809
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: writer-core
>Reporter: Vinoth Chandar
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Idea here is to add a 
> {_}hoodie.checksum=
> entry as the last value of hoodie.properties and throw an error if it does 
> not validate. this is to guard partial writes on HDFS
>  
> Main implementation issue is use of Properties which is a hashtable, so the 
> entry is not added as the last value necessarily. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3366) Remove unnecessary hardcoded logic of disabling metadata table in tests

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3366:
-
Sprint: Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Remove unnecessary hardcoded logic of disabling metadata table in tests
> ---
>
> Key: HUDI-3366
> URL: https://issues.apache.org/jira/browse/HUDI-3366
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3074) Docs for Z-order

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3074:
-
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  
(was: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Docs for Z-order
> 
>
> Key: HUDI-3074
> URL: https://issues.apache.org/jira/browse/HUDI-3074
> Project: Apache Hudi
>  Issue Type: Task
>  Components: clustering, docs
>Reporter: Kyle Weller
>Assignee: Kyle Weller
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3175) Support INDEX action for async metadata indexing

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3175:
-
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Support INDEX action for async metadata indexing
> 
>
> Key: HUDI-3175
> URL: https://issues.apache.org/jira/browse/HUDI-3175
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index, metadata
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: metadata, pull-request-available
> Fix For: 0.11.0
>
>
> Add a new WriteOperationType and handle conflicts with concurrent writer or 
> any other async table service. Implement the protocol in HUDI-2488



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1180) Upgrade HBase to 2.x

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1180:
-
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Upgrade HBase to 2.x
> 
>
> Key: HUDI-1180
> URL: https://issues.apache.org/jira/browse/HUDI-1180
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Affects Versions: 0.9.0
>Reporter: Wenning Ding
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Trying to upgrade HBase to 2.3.3 but ran into several issues.
> According to the Hadoop version support matrix: 
> [http://hbase.apache.org/book.html#hadoop], also need to upgrade Hadoop to 
> 2.8.5+.
>  
> There are several API conflicts between HBase 2.2.3 and HBase 1.2.3, we need 
> to resolve this first. After resolving conflicts, I am able to compile it but 
> then I ran into a tricky jetty version issue during the testing:
> {code:java}
> [ERROR] TestHBaseIndex.testDelete()  Time elapsed: 4.705 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdate()  Time elapsed: 0.174 
> s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdateWithRollback()  Time 
> elapsed: 0.076 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testSmallBatchSize()  Time elapsed: 0.122 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTagLocationAndDuplicateUpdate()  Time elapsed: 
> 0.16 s  <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalGetsBatching()  Time elapsed: 1.771 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR] TestHBaseIndex.testTotalPutsBatching()  Time elapsed: 0.082 s  <<< 
> ERROR!
> java.lang.NoSuchMethodError: 
> org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> 34206 [Thread-260] WARN  
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner  - DirectoryScanner: 
> shutdown has been called
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager  - 
> IncrementalBlockReportManager interrupted
> 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to 
> localhost/127.0.0.1:55924] WARN  
> org.apache.hadoop.hdfs.server.datanode.DataNode  - Ending block pool service 
> for: Block pool BP-1058834949-10.0.0.2-1597189606506 (Datanode Uuid 
> cb7bd8aa-5d79-4955-b1ec-bdaf7f1b6431) service to localhost/127.0.0.1:55924
> 34246 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data1/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 34247 
> [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data2/current/BP-1058834949-10.0.0.2-1597189606506]
>  WARN  org.apache.hadoop.fs.CachingGetSpaceUsed  - Thread Interrupted waiting 
> to refresh disk information: sleep interrupted
> 37192 [HBase-Metrics2-1] WARN  org.apache.hadoop.metrics2.impl.MetricsConfig  
> - Cannot locate configuration: tried 
> hadoop-metrics2-datanode.properties,hadoop-metrics2.properties
> 43904 
> [master/iad1-ws-cor-r12:0:becomeActiveMaster-SendThread(localhost:58768)] 
> WARN  org.apache.zookeeper.ClientCnxn  - Session 0x173dfeb0c8b0004 for server 
> null, unexpected error, closing socket connection and attempting reconnect
> java.net.ConnectException: Connection refused
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>   at 
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> [INFO] 
> [INFO] Results:
> [INFO] 
> [ERROR] Errors: 
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
> [ERROR]   org.eclipse.jetty.server.session.SessionHandler.s

[jira] [Updated] (HUDI-1296) Implement Spark DataSource using range metadata for file/partition pruning

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1296:
-
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Implement Spark DataSource using range metadata for file/partition pruning
> --
>
> Key: HUDI-1296
> URL: https://issues.apache.org/jira/browse/HUDI-1296
> Project: Apache Hudi
>  Issue Type: Task
>  Components: spark
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> MT exposed as Spark DataSource should provide for following interface:
> {code:java}
> Columnar interface to MT:
>  - Filename
>- Already available in the payload
>  - Partition Path (?)
> - We can decode on the fly (we have list of partitions, so we 
> can match it with the key)
>  - Col A stats
>  - Col B stats
>  - ... {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2695) Documentation

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2695:
-
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  
(was: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Documentation
> -
>
> Key: HUDI-2695
> URL: https://issues.apache.org/jira/browse/HUDI-2695
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Kyle Weller
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3280) Clean up unused/deprecated methods

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3280:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Clean up unused/deprecated methods
> --
>
> Key: HUDI-3280
> URL: https://issues.apache.org/jira/browse/HUDI-3280
> Project: Apache Hudi
>  Issue Type: Task
>  Components: reader-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Clean up unused/deprecated methods as well as additional validations in 
>  * HoodieInputFormatUtils



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2751) To avoid the duplicates for streaming read MOR table

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2751:
-
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> To avoid the duplicates for streaming read MOR table
> 
>
> Key: HUDI-2751
> URL: https://issues.apache.org/jira/browse/HUDI-2751
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Common Core
>Reporter: Danny Chen
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Imagine there are commits on the timeline:
> inflight compaction complete compaction
> | |
> {code:java}
> -instant 99 - instant 100 - 101 — 102 -- instant 100 --
> first read ->| second read ->|
> – range 1 | --range 2 ---|
>   {code}
> instant 99, 101, 102 are successful non-compaction delta commits;
> instant 100 is compaction instant,
> the first inc read consumes to instant 99 and the second read consumes from 
> instant 100 to instant 102, the second read would consumes the commit files 
> of instant 100 which has already been consumed before.
> The duplicate reading happens when this condition triggers: a compaction 
> instant schedules then completes in *one* consume range.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3177) CREATE INDEX command

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3177:
-
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> CREATE INDEX command
> 
>
> Key: HUDI-3177
> URL: https://issues.apache.org/jira/browse/HUDI-3177
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index, metadata
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Users should be able to trigger index creation using CREATE INDEX statement 
> or a CLI tool by capturing below options for one or more partitions.
>  
> {code:java}
> CREATE [BLOOM | COL_STATS | SOME_INDEX_TYPE] INDEX ON TABLE  [table_name] FOR 
> COLUMNS (col1, col2, col3) WITH OPTION  (, 
> );{code}
>  
> Maps to following hudi configs:
> {code:java}
> METADATA_PREFIX + ".index.bloom.filter.file.group.count” 
> METADATA_PREFIX + ".index.column.stats.file.group.count" 
> METADATA_PREFIX + ".index.bloom.filter.for.columns” -> comma-separated column 
> names 
> METADATA_PREFIX + ".index.column.stats.for.columns" -> comma-separated column 
> names{code}
> Even the CLI indexer tool will map user inputs to the above configs.
> By default, bloom filter will only be for record key and column stats will be 
> for all columns.
> For v0.11.0, our assumption is:
>  # Static file group count for all columns.
>  # Infer the set of columns that have already been indexed from the MT 
> partition layout (see HUDI-3258).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-1370) Scoping work needed to support bootstrapped data table and RFC-15 together

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1370:
-
Sprint: Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Scoping work needed to support bootstrapped data table and RFC-15 together
> --
>
> Key: HUDI-1370
> URL: https://issues.apache.org/jira/browse/HUDI-1370
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Common Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3075) Docs for Debezium source

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3075:
-
Sprint: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  
(was: Hudi-Sprint-Jan-3, Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, 
Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7)

> Docs for Debezium source
> 
>
> Key: HUDI-3075
> URL: https://issues.apache.org/jira/browse/HUDI-3075
> Project: Apache Hudi
>  Issue Type: Task
>  Components: deltastreamer, docs
>Reporter: Kyle Weller
>Assignee: Kyle Weller
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3341) Investigate that metadata table cannot be read for hadoop-aws 2.7.x

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3341:
-
Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Investigate that metadata table cannot be read for hadoop-aws 2.7.x
> ---
>
> Key: HUDI-3341
> URL: https://issues.apache.org/jira/browse/HUDI-3341
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: metadata
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: HUDI-bug
> Fix For: 0.11.0
>
>
> Environment: spark 2.4.4 + aws-java-sdk-1.7.4 + hadoop-aws-2.7.4, Hudi 
> 0.11.0-SNAPSHOT, metadata table enabled
> On the write path, the ingestion is successful with metadata table updated.  
> When trying to read the metadata table for listing, e.g., using hudi-cli, the 
> operation fails with the following exception.
> {code:java}
> Failed to retrieve list of partition from metadata
> org.apache.hudi.exception.HoodieMetadataException: Failed to retrieve list of 
> partition from metadata
>     at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:110)
>     at 
> org.apache.hudi.cli.commands.MetadataCommand.listPartitions(MetadataCommand.java:208)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216)
>     at 
> org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68)
>     at 
> org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59)
>     at 
> org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134)
>     at 
> org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533)
>     at org.springframework.shell.core.JLineShell.run(JLineShell.java:179)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieException: Exception when reading 
> log file 
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:334)
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:179)
>     at 
> org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:103)
>     at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.(HoodieMetadataMergedLogRecordReader.java:71)
>     at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.(HoodieMetadataMergedLogRecordReader.java:51)
>     at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader$Builder.build(HoodieMetadataMergedLogRecordReader.java:246)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getLogRecordScanner(HoodieBackedTableMetadata.java:376)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$openReadersIfNeeded$4(HoodieBackedTableMetadata.java:292)
>     at 
> java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.openReadersIfNeeded(HoodieBackedTableMetadata.java:282)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$0(HoodieBackedTableMetadata.java:138)
>     at java.util.HashMap.forEach(HashMap.java:1289)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:137)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordByKey(HoodieBackedTableMetadata.java:127)
>     at 
> org.apache.hudi.metadata.BaseTableMetadata.fetchAllPartitionPaths(BaseTableMetadata.java:275)
>     at 
> org.apache.hudi.metadata.BaseTableMetadata.getAllPartitionPaths(BaseTableMetadata.java:108)
>     ... 12 more
> Caused by: org.apache.hudi.exception.HoodieIOException: IOException when 
> reading logblock from log file 
> HoodieLogFile{pathStr='s3a://hudi-testing/metadata_test_table_2/.hoodie/metadata/files/.files-_00.log.1_0-0-0',
>  fileLen=-1}
>     at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.next(HoodieLogFileReader.java:375)
>     at 
> org.apache.hudi.common.table.log.HoodieLogFormatReader.next(HoodieLogFormatReader.java:120)
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:211)
>     .

[jira] [Updated] (HUDI-3208) Come up with rollout plan for enabling metadata table by default in 0.11

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-3208:
-
Sprint: Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, 
Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14  (was: 
Hudi-Sprint-Jan-10, Hudi-Sprint-Jan-18, Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Come up with rollout plan for enabling metadata table by default in 0.11
> 
>
> Key: HUDI-3208
> URL: https://issues.apache.org/jira/browse/HUDI-3208
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata, writer-core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Code-level 
>  * We should throw errors if lock provider is not configured
>  * At no point, should we lead to unwitting users to corrupt their tables
> Docs, get community feedback on any proposal. 
> Then we start testing this more. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-512) Support for Index functions on columns to generate logical or micro partitioning

2022-02-14 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-512:

Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, 
Hudi-Sprint-Feb-14  (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, 
Hudi-Sprint-Feb-7)

> Support for Index functions on columns to generate logical or micro 
> partitioning
> 
>
> Key: HUDI-512
> URL: https://issues.apache.org/jira/browse/HUDI-512
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Common Core
>Affects Versions: 0.9.0
>Reporter: Alexander Filipchik
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: features
> Fix For: 0.11.0
>
>
> This one is more inspirational, but, I believe, will be very useful. 
> Currently hudi is following Hive table format, which means that data is 
> logically and physically partitioned into folder structure like:
> table_name
>   2019
>     01
>     02
>        bla.parquet
>  
> This has several issues:
>  1) Modern object stores (AWS S3, GCP) are more performant when each file 
> name starts with some kind of a random value. By definition Hive layout is 
> not perfect
> 2) Hive Metastore stores partitions in the text field in the single table (2 
> tables with very similar information) and doesn't support proper filtering. 
> Data partitioned by day will be stored like:
> 2019/01/10
> 2019/01/11
> so only regexp queries are suported (at least in Hive 2.X.X)
> 3) Having a single POF which relies on non distributed DB is dangerous and 
> creates bottlenecks. 
>  
> The idea is to get rid of logical partitioning all together (and hive 
> metastore as well). If dataset has a time columns, user should be able to 
> query it without understanding what is the physical layout of the table (by 
> specifying those partitions explicitly or ending up with a full table scan 
> accidentally).
> It will require some kind of mapping of time to file locations (similar to 
> Iceberg). I'm also leaning towards the idea that storing table metadata with 
> the table is a good thing as it can be read by the engine in one shot and 
> will be faster that taxing a standalone metastore. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (HUDI-1576) Add ability to perform archival synchronously

2022-02-14 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-1576.

Resolution: Done

> Add ability to perform archival synchronously
> -
>
> Key: HUDI-1576
> URL: https://issues.apache.org/jira/browse/HUDI-1576
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: archiving
>Reporter: Nishith Agarwal
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Currently, archival runs inline. We want to move archival to a table service 
> like cleaning, compaction etc..
> and treat it like that. of course, no new action will be introduced. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (HUDI-2370) Supports data encryption (COW tables)

2022-02-14 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-2370.

Resolution: Done

> Supports data encryption (COW tables)
> -
>
> Key: HUDI-2370
> URL: https://issues.apache.org/jira/browse/HUDI-2370
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Data security is becoming more and more important, if hudi can support 
> encryption, it is very welcome
> 1. Specify column encryption
>  2. Support footer encryption
>  3. Custom encrypted client interface（Provide memory-based encryption client 
> by default）
> 4. Specify the encryption key
>  
> When querying, you need to pass the relevant key or obtain query permission 
> based on the client's encrypted interface. If it fails, the result cannot be 
> returned.
>  1. When querying non-encrypted fields, the key is not passed, and the data 
> is returned normally
>  2. When querying encrypted fields, the key is not passed and the data is not 
> returned
>  3. When the encrypted field is queried, the key is passed, and the data is 
> returned normally
>  4. When querying all fields, the key is not passed and no result is 
> returned. If passed, the data returns normally
>  
> Start with COW first



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (HUDI-3254) Introduce HoodieCatalog to manage tables for Spark Datasource V2

2022-02-14 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu resolved HUDI-3254.
--

> Introduce HoodieCatalog to manage tables for Spark Datasource V2
> 
>
> Key: HUDI-3254
> URL: https://issues.apache.org/jira/browse/HUDI-3254
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark
>Reporter: leesf
>Assignee: leesf
>Priority: Blocker
>  Labels: pull-request-available, sev:normal
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (HUDI-3398) Schema validation fails for metadata table base file

2022-02-14 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo closed HUDI-3398.
---
Resolution: Fixed

> Schema validation fails for metadata table base file
> 
>
> Key: HUDI-3398
> URL: https://issues.apache.org/jira/browse/HUDI-3398
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: Yue Zhang
>Priority: Blocker
>  Labels: HUDI-bug, pull-request-available
> Fix For: 0.11.0
>
>
> Stacktrace:
> {code:java}
> java.lang.IllegalArgumentException: Unknown file format 
> :file:/Users/ethan/Work/data/hudi/metadata_test_ds_mor_continuous_4/.hoodie/metadata/files/files-_0-93-815_20220208164926830001.hfile
>  at 
> org.apache.hudi.common.table.TableSchemaResolver.getTableParquetSchemaFromDataFile(TableSchemaResolver.java:103)
>  at 
> org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchemaFromDataFile(TableSchemaResolver.java:119)
> at 
> org.apache.hudi.common.table.TableSchemaResolver.hasOperationField(TableSchemaResolver.java:480)
>  at 
> org.apache.hudi.common.table.TableSchemaResolver.(TableSchemaResolver.java:65)
>  at org.apache.hudi.table.HoodieTable.validateSchema(HoodieTable.java:682)
>at 
> org.apache.hudi.table.HoodieTable.validateUpsertSchema(HoodieTable.java:698) 
> at 
> org.apache.hudi.client.SparkRDDWriteClient.upsertPreppedRecords(SparkRDDWriteClient.java:173)
> at 
> org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.commit(SparkHoodieBackedTableMetadataWriter.java:154)
>   at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.processAndCommit(HoodieBackedTableMetadataWriter.java:663)
>   at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:675)
> at 
> org.apache.hudi.client.BaseHoodieWriteClient.lambda$writeTableMetadata$0(BaseHoodieWriteClient.java:273)
>  at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) at 
> org.apache.hudi.client.BaseHoodieWriteClient.writeTableMetadata(BaseHoodieWriteClient.java:273)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.commit(BaseHoodieWriteClient.java:229)
>   at 
> org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:199)
>  at 
> org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:127)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:609)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:329)
>at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:652)
>  at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748) {code}
> full logs: https://gist.github.com/yihua/e00a1caddacbdc570b5b757049750f39



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] hudi-bot removed a comment on pull request #3808: [HUDI-2560] introduce id_based schema to support full schema evolution.

2022-02-14 Thread GitBox



hudi-bot removed a comment on pull request #3808:
URL: https://github.com/apache/hudi/pull/3808#issuecomment-1039829492


   
   ## CI report:
   
   * 58a7ff6fa38b21edeab6b2b221cd133f2eba9c28 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6008)
 
   * c585862a92bd8c1c184cfa2dd90fc7fc23830886 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6016)
 
   * ad1630ead2e2530fe92c54883bc55f7852d9af10 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #3808: [HUDI-2560] introduce id_based schema to support full schema evolution.

2022-02-14 Thread GitBox



hudi-bot commented on pull request #3808:
URL: https://github.com/apache/hudi/pull/3808#issuecomment-1039831567


   
   ## CI report:
   
   * c585862a92bd8c1c184cfa2dd90fc7fc23830886 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6016)
 
   * ad1630ead2e2530fe92c54883bc55f7852d9af10 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6018)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4667: [HUDI-3276] Rebased Parquet-based `FileInputFormat` impls to inherit from `MapredParquetInputFormat`

2022-02-14 Thread GitBox



alexeykudinkin commented on a change in pull request #4667:
URL: https://github.com/apache/hudi/pull/4667#discussion_r806428520



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java
##
@@ -161,6 +173,15 @@ protected FileSplit makeSplit(Path file, long start, long 
length,
 return returns.toArray(new FileStatus[0]);
   }
 
+  @Override
+  public RecordReader getRecordReader(InputSplit 
split, JobConf job, Reporter reporter) throws IOException {
+throw new UnsupportedEncodingException("not implemented");

Review comment:
   Good catch! Just a typo, will address in follow-up




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 3 4 5 6 >

1 - 100 of 524 matches

Mail list logo