date:20220823

[GitHub] [hudi] hudi-bot commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-23 Thread GitBox



hudi-bot commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1225193356

   
   ## CI report:
   
   * 06f352b0235cbbac215174c2755fca24009799c5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10912)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #4622: [SUPPORT] Can't query Redshift rows even after downgrade from 0.10

2022-08-23 Thread GitBox



nsivabalan commented on issue #4622:
URL: https://github.com/apache/hudi/issues/4622#issuecomment-1225182724

   thanks @nochimow for the update. appreciate it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] fengjian428 commented on issue #6441: Status on PR: 2666: Support update partial fields for CoW table

2022-08-23 Thread GitBox



fengjian428 commented on issue #6441:
URL: https://github.com/apache/hudi/issues/6441#issuecomment-1225175998

   > What I understand -> OverwriteNonDefaultsWithLatestAvroPayload can update 
the non-null fields in the new data(cdc) to the old data(Hudi table) But what 
if I have multiple changes for the same Record key into new cdc data then it 
won't give me correct output.
   > 
   > For example: Hudi Table: RK1, F1, F2, F3, F4, F5
   > 
   > New cdc data: RK1, null, null, F3', null, F5' RK1, F1', null, F3", null, 
null RK1, null, F2', null, F4', F5"
   > 
   > So Expected output of Record key(RK1) row in Hudi Table would be: RK1, 
F1', F2', F3", F4', F5"
   > 
   > Is there any future plan to merge following work into Hudi master which 
can help us to get partial updates ? #2666
   
   try turn off hoodie.combine.before.upsert？


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] brskiran1 commented on issue #6304: Hudi MultiTable Deltastreamer not updating glue catalog when new column added on Source

2022-08-23 Thread GitBox



brskiran1 commented on issue #6304:
URL: https://github.com/apache/hudi/issues/6304#issuecomment-1225173640

   @rmahindra123  responding on behalf of @SubashRanganathan . we have tried 
this without the flag hoodie.schema.on.read.enable set to true. Still dont see 
glue catalog updated with new column. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-4698) Rename the package 'org.apache.flink.table.data' to avoid conflicts with flink table core

2022-08-23 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583983#comment-17583983
 ] 

Danny Chen commented on HUDI-4698:
--

Fixed via master branch: 822c1397e04936b89fda771bb1c269de5fb0dd4b

> Rename the package 'org.apache.flink.table.data' to avoid conflicts with 
> flink table core
> -
>
> Key: HUDI-4698
> URL: https://issues.apache.org/jira/browse/HUDI-4698
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HUDI-4698) Rename the package 'org.apache.flink.table.data' to avoid conflicts with flink table core

2022-08-23 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen resolved HUDI-4698.
--

> Rename the package 'org.apache.flink.table.data' to avoid conflicts with 
> flink table core
> -
>
> Key: HUDI-4698
> URL: https://issues.apache.org/jira/browse/HUDI-4698
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated (16a80e6d41 -> 822c1397e0)

2022-08-23 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 16a80e6d41 [HUDI-4637] Release thread in RateLimiter doesn't been 
terminated (#6433)
 add 822c1397e0 [HUDI-4698] Rename the package 
'org.apache.flink.table.data' to avoid conflicts with flink table core (#6481)

No new revisions were added by this update.

Summary of changes:
 .../{flink => hudi}/table/data/ColumnarArrayData.java| 16 
 .../{flink => hudi}/table/data/ColumnarMapData.java  |  4 +++-
 .../{flink => hudi}/table/data/ColumnarRowData.java  | 12 ++--
 .../table/data/vector/MapColumnVector.java   |  3 ++-
 .../table/data/vector/RowColumnVector.java   |  6 --
 .../table/data/vector/VectorizedColumnBatch.java | 14 +-
 .../hudi/table/format/cow/ParquetSplitReaderUtil.java|  2 +-
 .../hudi/table/format/cow/vector/HeapArrayVector.java|  3 ++-
 .../table/format/cow/vector/HeapMapColumnVector.java |  5 +++--
 .../table/format/cow/vector/HeapRowColumnVector.java |  7 ---
 .../format/cow/vector/reader/ArrayColumnReader.java  |  2 +-
 .../cow/vector/reader/ParquetColumnarRowSplitReader.java |  4 ++--
 12 files changed, 57 insertions(+), 21 deletions(-)
 rename hudi-flink-datasource/hudi-flink1.13.x/src/main/java/org/apache/{flink 
=> hudi}/table/data/ColumnarArrayData.java (93%)
 rename hudi-flink-datasource/hudi-flink1.13.x/src/main/java/org/apache/{flink 
=> hudi}/table/data/ColumnarMapData.java (94%)
 rename hudi-flink-datasource/hudi-flink1.13.x/src/main/java/org/apache/{flink 
=> hudi}/table/data/ColumnarRowData.java (93%)
 rename hudi-flink-datasource/hudi-flink1.13.x/src/main/java/org/apache/{flink 
=> hudi}/table/data/vector/MapColumnVector.java (90%)
 rename hudi-flink-datasource/hudi-flink1.13.x/src/main/java/org/apache/{flink 
=> hudi}/table/data/vector/RowColumnVector.java (85%)
 rename hudi-flink-datasource/hudi-flink1.13.x/src/main/java/org/apache/{flink 
=> hudi}/table/data/vector/VectorizedColumnBatch.java (84%)

[GitHub] [hudi] namuny commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering

2022-08-23 Thread GitBox



namuny commented on issue #6212:
URL: https://github.com/apache/hudi/issues/6212#issuecomment-1225170438

   Gentle bump to see if anyone has any further recommendations on what 
information we could provide to help with reproducing the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 merged pull request #6481: [HUDI-4698] Rename the package 'org.apache.flink.table.data' to avoid…

2022-08-23 Thread GitBox



danny0405 merged PR #6481:
URL: https://github.com/apache/hudi/pull/6481


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #6481: [HUDI-4698] Rename the package 'org.apache.flink.table.data' to avoid…

2022-08-23 Thread GitBox



danny0405 commented on PR #6481:
URL: https://github.com/apache/hudi/pull/6481#issuecomment-1225170342

   The failed test case is flaky, should not be caused by this patch, would 
merge this PR and fix it in another PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-23 Thread GitBox



hudi-bot commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1225160370

   
   ## CI report:
   
   * b54e1a1397b1294cc4dc6e28bdfea7fb4ccaceab Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10892)
 
   * 06f352b0235cbbac215174c2755fca24009799c5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10912)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-23 Thread GitBox



hudi-bot commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1225157843

   
   ## CI report:
   
   * b54e1a1397b1294cc4dc6e28bdfea7fb4ccaceab Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10892)
 
   * 06f352b0235cbbac215174c2755fca24009799c5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TengHuo commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-23 Thread GitBox



TengHuo commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1225157773

   Done, updated the method `parseDateFromInstantTimeSafely`, it will log a 
warning message and return `Option.empty` when get an invalid timestamp, so 
won't output metrics when the timestamp is invalid.
   
   And rebased the code to the latest master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Zouxxyy commented on issue #6479: [SUPPORT] How to query the previous SNAPSHOT in Hive

2022-08-23 Thread GitBox



Zouxxyy commented on issue #6479:
URL: https://github.com/apache/hudi/issues/6479#issuecomment-1225129309

   I guess it's still under development


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TengHuo commented on a diff in pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-23 Thread GitBox



TengHuo commented on code in PR #6000:
URL: https://github.com/apache/hudi/pull/6000#discussion_r953286608


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java:
##
@@ -75,16 +75,56 @@ public class HoodieActiveTimeline extends 
HoodieDefaultTimeline {
   REQUESTED_REPLACE_COMMIT_EXTENSION, INFLIGHT_REPLACE_COMMIT_EXTENSION, 
REPLACE_COMMIT_EXTENSION,
   REQUESTED_INDEX_COMMIT_EXTENSION, INFLIGHT_INDEX_COMMIT_EXTENSION, 
INDEX_COMMIT_EXTENSION,
   REQUESTED_SAVE_SCHEMA_ACTION_EXTENSION, 
INFLIGHT_SAVE_SCHEMA_ACTION_EXTENSION, SAVE_SCHEMA_ACTION_EXTENSION));
+
+  private static final Set NOT_PARSABLE_TIMESTAMPS = new 
HashSet(3) {{
+  add(HoodieTimeline.INIT_INSTANT_TS);
+  add(HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS);
+  add(HoodieTimeline.FULL_BOOTSTRAP_INSTANT_TS);
+}};
+
   private static final Logger LOG = 
LogManager.getLogger(HoodieActiveTimeline.class);
   protected HoodieTableMetaClient metaClient;
 
   /**
* Parse the timestamp of an Instant and return a {@code Date}.
+   * Throw ParseException if timestamp not valid format as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   *
+   * @param timestamp a timestamp String which follow pattern as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   * @return Date of instant timestamp
*/
   public static Date parseDateFromInstantTime(String timestamp) throws 
ParseException {
 return HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp);
   }
 
+  /**
+   * The same format method as above, but this method will mute ParseException
+   * if the gaven timestamp is invalid and return Date(0), or a corresponding 
Date if these timestamp provided
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#INIT_INSTANT_TS},
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#METADATA_BOOTSTRAP_INSTANT_TS},
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#FULL_BOOTSTRAP_INSTANT_TS}.
+   * This method is useful when parse timestamp for metrics
+   *
+   * @param timestamp a timestamp String which follow pattern as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   * @return Date of instant timestamp
+   */
+  public static Date parseDateFromInstantTimeSafely(String timestamp) {
+Date parsedDate;
+try {
+  parsedDate = 
HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp);
+} catch (ParseException e) {
+  LOG.warn("Failed to parse timestamp " + timestamp + " because of " + 
e.getMessage());
+  if (NOT_PARSABLE_TIMESTAMPS.contains(timestamp)) {
+parsedDate = new Date(Integer.parseInt(timestamp));
+  } else {
+parsedDate = new Date(0);

Review Comment:
   It's the old logic in `HoodieInstantTimeGenerator.parseDateFromInstantTime`, 
if it catch the error and the timestamp is all zero, it will return `Date(0)`, 
so I keep it.
   
   ```java
 // Special handling for all zero timestamp which is not parsable by 
DateTimeFormatter
 if (timestamp.equals(ALL_ZERO_TIMESTAMP)) {
   return new Date(0);
 }
 throw e;
   ```
   
   but I agree with you, it will return a dirty value, which is bad for the 
code where it uses this method. `parseDateFromInstantTimeSafely` should return 
an optional value, then the code who use this method can decide how to deal 
with Option.empty.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-23 Thread GitBox



danny0405 commented on code in PR #6000:
URL: https://github.com/apache/hudi/pull/6000#discussion_r953281948


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java:
##
@@ -75,16 +75,56 @@ public class HoodieActiveTimeline extends 
HoodieDefaultTimeline {
   REQUESTED_REPLACE_COMMIT_EXTENSION, INFLIGHT_REPLACE_COMMIT_EXTENSION, 
REPLACE_COMMIT_EXTENSION,
   REQUESTED_INDEX_COMMIT_EXTENSION, INFLIGHT_INDEX_COMMIT_EXTENSION, 
INDEX_COMMIT_EXTENSION,
   REQUESTED_SAVE_SCHEMA_ACTION_EXTENSION, 
INFLIGHT_SAVE_SCHEMA_ACTION_EXTENSION, SAVE_SCHEMA_ACTION_EXTENSION));
+
+  private static final Set NOT_PARSABLE_TIMESTAMPS = new 
HashSet(3) {{
+  add(HoodieTimeline.INIT_INSTANT_TS);
+  add(HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS);
+  add(HoodieTimeline.FULL_BOOTSTRAP_INSTANT_TS);
+}};
+
   private static final Logger LOG = 
LogManager.getLogger(HoodieActiveTimeline.class);
   protected HoodieTableMetaClient metaClient;
 
   /**
* Parse the timestamp of an Instant and return a {@code Date}.
+   * Throw ParseException if timestamp not valid format as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   *
+   * @param timestamp a timestamp String which follow pattern as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   * @return Date of instant timestamp
*/
   public static Date parseDateFromInstantTime(String timestamp) throws 
ParseException {
 return HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp);
   }
 
+  /**
+   * The same format method as above, but this method will mute ParseException
+   * if the gaven timestamp is invalid and return Date(0), or a corresponding 
Date if these timestamp provided
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#INIT_INSTANT_TS},
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#METADATA_BOOTSTRAP_INSTANT_TS},
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#FULL_BOOTSTRAP_INSTANT_TS}.
+   * This method is useful when parse timestamp for metrics
+   *
+   * @param timestamp a timestamp String which follow pattern as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   * @return Date of instant timestamp
+   */
+  public static Date parseDateFromInstantTimeSafely(String timestamp) {
+Date parsedDate;
+try {
+  parsedDate = 
HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp);
+} catch (ParseException e) {
+  LOG.warn("Failed to parse timestamp " + timestamp + " because of " + 
e.getMessage());
+  if (NOT_PARSABLE_TIMESTAMPS.contains(timestamp)) {
+parsedDate = new Date(Integer.parseInt(timestamp));
+  } else {
+parsedDate = new Date(0);

Review Comment:
   What is the meanings of reporting a `Date(0)` which is a dirty data for 
metrics i think, instead we should not report any metrics at all in this case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TengHuo commented on a diff in pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-23 Thread GitBox



TengHuo commented on code in PR #6000:
URL: https://github.com/apache/hudi/pull/6000#discussion_r953269114


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java:
##
@@ -75,16 +75,56 @@ public class HoodieActiveTimeline extends 
HoodieDefaultTimeline {
   REQUESTED_REPLACE_COMMIT_EXTENSION, INFLIGHT_REPLACE_COMMIT_EXTENSION, 
REPLACE_COMMIT_EXTENSION,
   REQUESTED_INDEX_COMMIT_EXTENSION, INFLIGHT_INDEX_COMMIT_EXTENSION, 
INDEX_COMMIT_EXTENSION,
   REQUESTED_SAVE_SCHEMA_ACTION_EXTENSION, 
INFLIGHT_SAVE_SCHEMA_ACTION_EXTENSION, SAVE_SCHEMA_ACTION_EXTENSION));
+
+  private static final Set NOT_PARSABLE_TIMESTAMPS = new 
HashSet(3) {{
+  add(HoodieTimeline.INIT_INSTANT_TS);
+  add(HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS);
+  add(HoodieTimeline.FULL_BOOTSTRAP_INSTANT_TS);
+}};
+
   private static final Logger LOG = 
LogManager.getLogger(HoodieActiveTimeline.class);
   protected HoodieTableMetaClient metaClient;
 
   /**
* Parse the timestamp of an Instant and return a {@code Date}.
+   * Throw ParseException if timestamp not valid format as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   *
+   * @param timestamp a timestamp String which follow pattern as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   * @return Date of instant timestamp
*/
   public static Date parseDateFromInstantTime(String timestamp) throws 
ParseException {
 return HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp);
   }
 
+  /**
+   * The same format method as above, but this method will mute ParseException
+   * if the gaven timestamp is invalid and return Date(0), or a corresponding 
Date if these timestamp provided
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#INIT_INSTANT_TS},
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#METADATA_BOOTSTRAP_INSTANT_TS},
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#FULL_BOOTSTRAP_INSTANT_TS}.
+   * This method is useful when parse timestamp for metrics
+   *
+   * @param timestamp a timestamp String which follow pattern as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   * @return Date of instant timestamp
+   */
+  public static Date parseDateFromInstantTimeSafely(String timestamp) {
+Date parsedDate;
+try {
+  parsedDate = 
HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp);
+} catch (ParseException e) {
+  LOG.warn("Failed to parse timestamp " + timestamp + " because of " + 
e.getMessage());

Review Comment:
   got it, np.
   and let me rebase my code
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-4637) Release thread in RateLimiter is not terminated

2022-08-23 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-4637.
-
Resolution: Fixed

> Release thread in RateLimiter is not terminated
> ---
>
> Key: HUDI-4637
> URL: https://issues.apache.org/jira/browse/HUDI-4637
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: xi chaomin
>Assignee: xi chaomin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When I use hbase index, I find the job can't be finished. I set log level to 
> DEBUG and see endless printing 
> {code:java}
> 22/08/17 18:26:45 DEBUG RateLimiter: Release permits: maxPremits: 100, 
> available: 100
> 22/08/17 18:26:45 DEBUG RateLimiter: Release permits: maxPremits: 1000, 
> available: 1000 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated (ca8a57a21d -> 16a80e6d41)

2022-08-23 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from ca8a57a21d [HUDI-4515] Fix savepoints will be cleaned in keeping 
latest versions policy (#6267)
 add 16a80e6d41 [HUDI-4637] Release thread in RateLimiter doesn't been 
terminated (#6433)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java| 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

[GitHub] [hudi] nsivabalan merged pull request #6433: [HUDI-4637] Release thread in RateLimiter is not terminated

2022-08-23 Thread GitBox



nsivabalan merged PR #6433:
URL: https://github.com/apache/hudi/pull/6433


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] linfey90 commented on a diff in pull request #6456: [HUDI-4674]Change the default value of inputFormat for the MOR table

2022-08-23 Thread GitBox



linfey90 commented on code in PR #6456:
URL: https://github.com/apache/hudi/pull/6456#discussion_r953253259


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala:
##
@@ -120,10 +119,8 @@ object CreateHoodieTableCommand {
 
 val tableType = tableConfig.getTableType.name()
 val inputFormat = tableType match {
-  case DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL =>
+  case DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL | 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL =>

Review Comment:
   In hive queries, the original table name is used instead of the suffix _rt 
_ro table name. at this point we will choose to skip the _ro table. I also 
think hive offline tasks should use optimization table, so its inputFormat 
default value should be HoodieParquetInputFormat.Also If this default value 
there are other considerations,I will compare and modify when use hive sync 
metadata.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #5920: [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool

2022-08-23 Thread GitBox



nsivabalan commented on PR #5920:
URL: https://github.com/apache/hudi/pull/5920#issuecomment-1225077158

   hey @kk17 : is there any updates on this patch. once its ready, let me know. 
I can take another look. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-23 Thread GitBox



nsivabalan commented on code in PR #6000:
URL: https://github.com/apache/hudi/pull/6000#discussion_r953246798


##
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java:
##
@@ -75,16 +75,56 @@ public class HoodieActiveTimeline extends 
HoodieDefaultTimeline {
   REQUESTED_REPLACE_COMMIT_EXTENSION, INFLIGHT_REPLACE_COMMIT_EXTENSION, 
REPLACE_COMMIT_EXTENSION,
   REQUESTED_INDEX_COMMIT_EXTENSION, INFLIGHT_INDEX_COMMIT_EXTENSION, 
INDEX_COMMIT_EXTENSION,
   REQUESTED_SAVE_SCHEMA_ACTION_EXTENSION, 
INFLIGHT_SAVE_SCHEMA_ACTION_EXTENSION, SAVE_SCHEMA_ACTION_EXTENSION));
+
+  private static final Set NOT_PARSABLE_TIMESTAMPS = new 
HashSet(3) {{
+  add(HoodieTimeline.INIT_INSTANT_TS);
+  add(HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS);
+  add(HoodieTimeline.FULL_BOOTSTRAP_INSTANT_TS);
+}};
+
   private static final Logger LOG = 
LogManager.getLogger(HoodieActiveTimeline.class);
   protected HoodieTableMetaClient metaClient;
 
   /**
* Parse the timestamp of an Instant and return a {@code Date}.
+   * Throw ParseException if timestamp not valid format as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   *
+   * @param timestamp a timestamp String which follow pattern as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   * @return Date of instant timestamp
*/
   public static Date parseDateFromInstantTime(String timestamp) throws 
ParseException {
 return HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp);
   }
 
+  /**
+   * The same format method as above, but this method will mute ParseException
+   * if the gaven timestamp is invalid and return Date(0), or a corresponding 
Date if these timestamp provided
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#INIT_INSTANT_TS},
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#METADATA_BOOTSTRAP_INSTANT_TS},
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieTimeline#FULL_BOOTSTRAP_INSTANT_TS}.
+   * This method is useful when parse timestamp for metrics
+   *
+   * @param timestamp a timestamp String which follow pattern as
+   *  {@link 
org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}.
+   * @return Date of instant timestamp
+   */
+  public static Date parseDateFromInstantTimeSafely(String timestamp) {
+Date parsedDate;
+try {
+  parsedDate = 
HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp);
+} catch (ParseException e) {
+  LOG.warn("Failed to parse timestamp " + timestamp + " because of " + 
e.getMessage());

Review Comment:
   can we move this warn msg into else block. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-4515) savepoints will be clean in keeping latest versions policy

2022-08-23 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-4515.
-
Resolution: Fixed

> savepoints will be clean in keeping latest versions policy
> --
>
> Key: HUDI-4515
> URL: https://issues.apache.org/jira/browse/HUDI-4515
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cleaning
>Affects Versions: 0.11.1
>Reporter: zouxxyy
>Assignee: zouxxyy
>Priority: Blocker
>  Labels: bug, clean, pull-request-available, savepoints
> Fix For: 0.12.1
>
>
> When I tested the behavior of clean and savepoint, I found that when clean is 
> keeping latest versions, the files of savepoint will be deleted. By reading 
> the code, I found that this should be a bug
>  
> For example, if I use "HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS", and 
> set the “hoodie.cleaner.fileversions.retained” to 2, I do the following:
> 1. insert, get _001.parquet
> 2. savepoint
> 3. insert, get _002.parquet
> 4. insert, get _003.parquet
> After the fourth step, the _001.parquet will be deleted even if it 
> belongs to savepoint !
>  
> here is: 
> hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
>  getFilesToCleanKeepingLatestVersions
>  * According to the following code, on the one hand, the checkpoints 
> belonging to keepversion will be skipped and will not be counted in the 
> calculation of keepversion, which I feel is unreasonable.
>  * On the other hand, if there is a checkpoint in the remaining version of 
> the files, it will be deleted, which I don't think is in line with the design 
> philosophy of savepoints.
> {code:java}
> while (fileSliceIterator.hasNext() && keepVersions > 0) {
>   // Skip this most recent version
>   FileSlice nextSlice = fileSliceIterator.next();
>   Option dataFile = nextSlice.getBaseFile();
>   if (dataFile.isPresent() && 
> savepointedFiles.contains(dataFile.get().getFileName())) {
> // do not clean up a savepoint data file
> continue;
>   }
>   keepVersions--;
> }
> // Delete the remaining files
> while (fileSliceIterator.hasNext()) {
>   FileSlice nextSlice = fileSliceIterator.next();
>   deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
> }{code}
>  
> So I think the judgment logic of the checkpoint should be moved down, if can 
> be fixed by this:
> {code:java}
> while (fileSliceIterator.hasNext() && keepVersions > 0) {
>   // Skip this most recent version
>   fileSliceIterator.next();
>   keepVersions--;
> }
> // Delete the remaining files
> while (fileSliceIterator.hasNext()) {
>   FileSlice nextSlice = fileSliceIterator.next();
>   Option dataFile = nextSlice.getBaseFile();
>   if (dataFile.isPresent() && 
> savepointedFiles.contains(dataFile.get().getFileName())) {
> // do not clean up a savepoint data file
> continue;
>   }
>   deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
> }{code}
>  
> Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] nsivabalan merged pull request #6267: [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy

2022-08-23 Thread GitBox



nsivabalan merged PR #6267:
URL: https://github.com/apache/hudi/pull/6267


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (1879efa45d -> ca8a57a21d)

2022-08-23 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from 1879efa45d [HUDI-4686] Flip option 'write.ignore.failed' to default 
false (#6467)
 add ca8a57a21d [HUDI-4515] Fix savepoints will be cleaned in keeping 
latest versions policy (#6267)

No new revisions were added by this update.

Summary of changes:
 .../hudi/table/action/clean/CleanPlanner.java  | 10 +--
 .../org/apache/hudi/client/TestClientRollback.java | 98 ++
 2 files changed, 103 insertions(+), 5 deletions(-)

[GitHub] [hudi] nsivabalan commented on pull request #6157: [HUDI-4431] Fix log file will not roll over to a new file

2022-08-23 Thread GitBox



nsivabalan commented on PR #6157:
URL: https://github.com/apache/hudi/pull/6157#issuecomment-1225068635

   @XuQianJin-Stars : hey. can you follow up on this. do we need a fix or if 
its already taken care. let us know. we can close it out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bhasudha commented on pull request #6482: [DOCS] Add youtube channel and Office hours page

2022-08-23 Thread GitBox



bhasudha commented on PR #6482:
URL: https://github.com/apache/hudi/pull/6482#issuecomment-1225060413

   **Image of the header**
   https://user-images.githubusercontent.com/2179254/186296081-1401a649-663e-4db0-9c67-5aef18ff6042.png;>
   The logo is updated but is not usually visible in local website deployment. 
Thats why you see an icon. 
   
   
   **Image of footer**
   https://user-images.githubusercontent.com/2179254/186296096-288d4329-2c22-4966-9616-5df52bbc8265.png;>
   You can verify the link as well
   
   **Image of weekly office hours page**
   https://user-images.githubusercontent.com/2179254/186296219-c02372b9-b72b-4985-b51a-89feba287f1b.png;>
   
   
   **Image of office hours in drop down**
   https://user-images.githubusercontent.com/2179254/186296640-85aee702-ed65-42d9-b1a3-31365789ee01.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bhasudha opened a new pull request, #6482: [DOCS] Add youttube channel and Office hours page

2022-08-23 Thread GitBox



bhasudha opened a new pull request, #6482:
URL: https://github.com/apache/hudi/pull/6482

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6135: [HUDI-4418] Add support for ProtoKafkaSource

2022-08-23 Thread GitBox



hudi-bot commented on PR #6135:
URL: https://github.com/apache/hudi/pull/6135#issuecomment-1225016353

   
   ## CI report:
   
   * d36fed637603d9959e8d049ac0815b9c729eb246 UNKNOWN
   * f70abbc3b45005d40e74252814edc0078a50030e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10909)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark 3.2 the default profile

2022-08-23 Thread GitBox



hudi-bot commented on PR #6105:
URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224985799

   
   ## CI report:
   
   * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN
   * 326f8f69ea423a58df8c98f382528efb9424d053 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10908)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6135: [HUDI-4418] Add support for ProtoKafkaSource

2022-08-23 Thread GitBox



hudi-bot commented on PR #6135:
URL: https://github.com/apache/hudi/pull/6135#issuecomment-1224982778

   
   ## CI report:
   
   * d36fed637603d9959e8d049ac0815b9c729eb246 UNKNOWN
   * 1879403e5a33bfcaa6d9d1d3e6e2cbc226403f90 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10905)
 
   * f70abbc3b45005d40e74252814edc0078a50030e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10909)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark 3.2 the default profile

2022-08-23 Thread GitBox



hudi-bot commented on PR #6105:
URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224982712

   
   ## CI report:
   
   * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN
   * 269aef1e346d379cdb5b76eb2aab9fc2945dcfc9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10907)
 
   * 326f8f69ea423a58df8c98f382528efb9424d053 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10908)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6135: [HUDI-4418] Add support for ProtoKafkaSource

2022-08-23 Thread GitBox



hudi-bot commented on PR #6135:
URL: https://github.com/apache/hudi/pull/6135#issuecomment-1224979442

   
   ## CI report:
   
   * d36fed637603d9959e8d049ac0815b9c729eb246 UNKNOWN
   * 1879403e5a33bfcaa6d9d1d3e6e2cbc226403f90 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10905)
 
   * f70abbc3b45005d40e74252814edc0078a50030e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark 3.2 the default profile

2022-08-23 Thread GitBox



hudi-bot commented on PR #6105:
URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224979379

   
   ## CI report:
   
   * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN
   * 269aef1e346d379cdb5b76eb2aab9fc2945dcfc9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10907)
 
   * 326f8f69ea423a58df8c98f382528efb9424d053 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6432: [HUDI-4586] Improve metadata fetching in bloom index

2022-08-23 Thread GitBox



hudi-bot commented on PR #6432:
URL: https://github.com/apache/hudi/pull/6432#issuecomment-1224976328

   
   ## CI report:
   
   * ed15f57dc58b2e9142dd33a0ecd078bf4c236afc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10887)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6352: [HUDI-4584] Fixing `SQLConf` not being propagated to executor

2022-08-23 Thread GitBox



alexeykudinkin commented on code in PR #6352:
URL: https://github.com/apache/hudi/pull/6352#discussion_r953174810


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/execution/SQLConfInjectingRDD.scala:
##
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.{Partition, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.internal.SQLConf
+
+import scala.reflect.ClassTag
+
+/**
+ * NOTE: This is a generalized version of of Spark's [[SQLExecutionRDD]]
+ *
+ * It is just a wrapper over [[sqlRDD]] which sets and makes effective all the 
configs from the
+ * captured [[SQLConf]]
+ *
+ * @param sqlRDD the `RDD` generated by the SQL plan
+ * @param conf the `SQLConf` to apply to the execution of the SQL plan
+ */
+class SQLConfInjectingRDD[T: ClassTag](var sqlRDD: RDD[T], @transient conf: 
SQLConf) extends RDD[T](sqlRDD) {
+  private val sqlConfigs = conf.getAllConfs
+  private lazy val sqlConfExecutorSide = {
+val newConf = new SQLConf()
+sqlConfigs.foreach { case (k, v) => newConf.setConfString(k, v) }
+newConf
+  }
+
+  override val partitioner = firstParent[InternalRow].partitioner
+
+  override def getPartitions: Array[Partition] = 
firstParent[InternalRow].partitions
+
+  override def compute(split: Partition, context: TaskContext): Iterator[T] = {
+// If we are in the context of a tracked SQL operation, 
`SQLExecution.EXECUTION_ID_KEY` is set
+// and we have nothing to do here. Otherwise, we use the `SQLConf` 
captured at the creation of
+// this RDD.
+if (context.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) == null) {
+  SQLConf.withExistingConf(sqlConfExecutorSide) {

Review Comment:
   Yes, it will propagate to all RDDs in the execution chain (up to a shuffling 
point)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on a diff in pull request #6352: [HUDI-4584] Fixing `SQLConf` not being propagated to executor

2022-08-23 Thread GitBox



yihua commented on code in PR #6352:
URL: https://github.com/apache/hudi/pull/6352#discussion_r953160525


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/execution/SQLConfInjectingRDD.scala:
##
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.{Partition, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.internal.SQLConf
+
+import scala.reflect.ClassTag
+
+/**
+ * NOTE: This is a generalized version of of Spark's [[SQLExecutionRDD]]
+ *
+ * It is just a wrapper over [[sqlRDD]] which sets and makes effective all the 
configs from the
+ * captured [[SQLConf]]
+ *
+ * @param sqlRDD the `RDD` generated by the SQL plan
+ * @param conf the `SQLConf` to apply to the execution of the SQL plan
+ */
+class SQLConfInjectingRDD[T: ClassTag](var sqlRDD: RDD[T], @transient conf: 
SQLConf) extends RDD[T](sqlRDD) {
+  private val sqlConfigs = conf.getAllConfs
+  private lazy val sqlConfExecutorSide = {
+val newConf = new SQLConf()
+sqlConfigs.foreach { case (k, v) => newConf.setConfString(k, v) }
+newConf
+  }
+
+  override val partitioner = firstParent[InternalRow].partitioner
+
+  override def getPartitions: Array[Partition] = 
firstParent[InternalRow].partitions
+
+  override def compute(split: Partition, context: TaskContext): Iterator[T] = {
+// If we are in the context of a tracked SQL operation, 
`SQLExecution.EXECUTION_ID_KEY` is set
+// and we have nothing to do here. Otherwise, we use the `SQLConf` 
captured at the creation of
+// this RDD.
+if (context.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) == null) {
+  SQLConf.withExistingConf(sqlConfExecutorSide) {

Review Comment:
   @alexeykudinkin I was asking the latter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6432: [HUDI-4586] Improve metadata fetching in bloom index

2022-08-23 Thread GitBox



hudi-bot commented on PR #6432:
URL: https://github.com/apache/hudi/pull/6432#issuecomment-1224932396

   
   ## CI report:
   
   * ed15f57dc58b2e9142dd33a0ecd078bf4c236afc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10887)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark 3.2 the default profile

2022-08-23 Thread GitBox



hudi-bot commented on PR #6105:
URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224931869

   
   ## CI report:
   
   * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN
   * 269aef1e346d379cdb5b76eb2aab9fc2945dcfc9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10907)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6432: [HUDI-4586] Improve metadata fetching in bloom index

2022-08-23 Thread GitBox



hudi-bot commented on PR #6432:
URL: https://github.com/apache/hudi/pull/6432#issuecomment-1224927464

   
   ## CI report:
   
   * ed15f57dc58b2e9142dd33a0ecd078bf4c236afc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dyang108 commented on issue #6428: [SUPPORT] S3 Deltastreamer: Block has already been inflated

2022-08-23 Thread GitBox



dyang108 commented on issue #6428:
URL: https://github.com/apache/hudi/issues/6428#issuecomment-1224925928

   Update: I got it working on an older version of Hudi 0.10.1, so seems like a 
regression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nikspatel03 commented on issue #6441: Status on PR: 2666: Support update partial fields for CoW table

2022-08-23 Thread GitBox



nikspatel03 commented on issue #6441:
URL: https://github.com/apache/hudi/issues/6441#issuecomment-1224887332

   What I understand ->  OverwriteNonDefaultsWithLatestAvroPayload can update 
the non-null fields in the new data(cdc) to the old data(Hudi table) But what 
if I have multiple changes for the same Record key into new cdc data then it 
won't give me correct output.  
   
   For example: 
   Hudi Table:
   RK1, F1, F2, F3, F4, F5
   
   New cdc data: 
   RK1, null, null, F3', null, F5'
   RK2, F1', null, F3", null, null
   RK3, null, F2', null, F4', F5"
   
   So Expected output of Record key(RK1) row in Hudi Table would be: 
   RK1, F1', F2', F3", F4', F5"
   
   Is there any future plan to merge following work into Hudi master which can 
help us to get partial updates ? 
   https://github.com/apache/hudi/pull/2666
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark 3.2 the default profile

2022-08-23 Thread GitBox



hudi-bot commented on PR #6105:
URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224873599

   
   ## CI report:
   
   * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN
   * 35c07f36c6409d471e1810833cec0b27cbf78cf9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10906)
 
   * 269aef1e346d379cdb5b76eb2aab9fc2945dcfc9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10907)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nochimow closed issue #4622: [SUPPORT] Can't query Redshift rows even after downgrade from 0.10

2022-08-23 Thread GitBox



nochimow closed issue #4622: [SUPPORT] Can't query Redshift rows even after 
downgrade from 0.10
URL: https://github.com/apache/hudi/issues/4622


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nochimow commented on issue #4622: [SUPPORT] Can't query Redshift rows even after downgrade from 0.10

2022-08-23 Thread GitBox



nochimow commented on issue #4622:
URL: https://github.com/apache/hudi/issues/4622#issuecomment-1224870906

   Even with AWS saying that only 0.10.0 is "supported", I did some 
compatibility tests with Hudi 0.10, 0.11 and 0.12. 
   All versions worked fine, like it wasn't before. (Prior to that, any table 
with Hudi version > 0.9 was returning 0 rows on Redshift Spectrum)
   The only detail here is that the Redshift version must be with the patch 
>=169. (Got this requirement from the AWS support)  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6474: [SUPPORT] Hudi Deltastreamer fails to acquire lock with DynamoDB Lock Provider.

2022-08-23 Thread GitBox



nsivabalan commented on issue #6474:
URL: https://github.com/apache/hudi/issues/6474#issuecomment-1224816176

   yeah. From what I see, the cleaner waits for the lock (which was acquired to 
apply `20220822020402958` to metadata table", but after retrying, before giving 
up, looks like the cleaner unlocks which should not happen. We did made a fix 
in 0.11.1 to avoid non owner releasing the lock 
[here](https://github.com/apache/hudi/pull/5255), but looks like there could be 
more to be looked into. 
   
   
   ```
   
   02:06:31 : acquiring lock by 20220822020402958__deltacommit__INFLIGHT in 
MDT. 
   
   02:06:46 : clean is attempted in data table. (async cleaner)
   02:06:48: clean tries to acquire lock. 
   
   22/08/22 02:06:48 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction starting for 
Optional.empty with latest completed transaction instant Optional.empty
   22/08/22 02:06:48 INFO org.apache.hudi.client.transaction.lock.LockManager: 
LockProvider org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
   
   02:07:47: after checking for compaction, new delta commit started in MDT.
   
   02:07:50: we see the deltacommit state is moved to completed. 
   
   02:08:22 : new delta commit starts. regular writer. 
   
   02:08:59: tries to acquire lock. 
   02:11:10: tries to acquire lock. 
   02:13:21: tries to acquire lock. 
   02:15:32:tries to acquire lock. 
   02:17:43:tries to acquire lock. 
   02:19:53:tries to acquire lock. 
   02:22:04:tries to acquire lock. 
   02:24:15:tries to acquire lock. 
   02:26:25:tries to acquire lock. 
   02:28:36:tries to acquire lock. 
   
   02:30:47: INFO org.apache.hudi.client.transaction.TransactionManager: 
Transaction ending with transaction owner Optional.empty
   22/08/22 02:30:47 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: RELEASING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:30:47 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction ended with 
transaction owner Optional.empty
   
   25 mins so far from the time clean tried to acquire lock.
   
   clean fails since it could not acquire the lock. 
   
   22/08/22 02:31:00 : original owner who acquired the lock is releasing it 
now. 
   
   22/08/22 02:31:00 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction ending with 
transaction owner Option{val=[==>20220822020402958__deltacommit__INFLIGHT]}
   22/08/22 02:31:00 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: RELEASING lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:31:00 INFO 
org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: RELEASED lock 
at DynamoDb table = HudiLocker, partition key = process
   22/08/22 02:31:00 INFO 
org.apache.hudi.client.transaction.TransactionManager: Transaction ended with 
transaction owner Option{val=[==>20220822020402958__deltacommit__INFLIGHT]}
   ```
   
   I might need to spend some more time to put in a fix for this. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark 3.2 the default profile

2022-08-23 Thread GitBox



hudi-bot commented on PR #6105:
URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224756201

   
   ## CI report:
   
   * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN
   * 35c07f36c6409d471e1810833cec0b27cbf78cf9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10906)
 
   * 269aef1e346d379cdb5b76eb2aab9fc2945dcfc9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark3 the default profile

2022-08-23 Thread GitBox



hudi-bot commented on PR #6105:
URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224734575

   
   ## CI report:
   
   * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN
   * 58aadea50328122e1a9a1b01d38e3af12e33fbe1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9947)
 
   * 35c07f36c6409d471e1810833cec0b27cbf78cf9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10906)
 
   * 269aef1e346d379cdb5b76eb2aab9fc2945dcfc9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark3 the default profile

2022-08-23 Thread GitBox



hudi-bot commented on PR #6105:
URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224723686

   
   ## CI report:
   
   * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN
   * 58aadea50328122e1a9a1b01d38e3af12e33fbe1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9947)
 
   * 35c07f36c6409d471e1810833cec0b27cbf78cf9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] minihippo commented on pull request #5920: [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool

2022-08-23 Thread GitBox



minihippo commented on PR #5920:
URL: https://github.com/apache/hudi/pull/5920#issuecomment-1224555479

   > > can we please write a test for the changes made.
   > 
   > any instruction on how to write a test?
   
   Hi @kk17, you can refer to the ut in `TestHiveSyncTool`, mock a table 
performed in 0.8, and  call the new sync function to make hive sync in 0.11 
success.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6135: [HUDI-4418] Add support for ProtoKafkaSource

2022-08-23 Thread GitBox



hudi-bot commented on PR #6135:
URL: https://github.com/apache/hudi/pull/6135#issuecomment-1224544985

   
   ## CI report:
   
   * d36fed637603d9959e8d049ac0815b9c729eb246 UNKNOWN
   * 1879403e5a33bfcaa6d9d1d3e6e2cbc226403f90 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10905)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-2369) Blog on bulk insert sort modes

2022-08-23 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2369:
--
Sprint: 2022/09/05

> Blog on bulk insert sort modes
> --
>
> Key: HUDI-2369
> URL: https://issues.apache.org/jira/browse/HUDI-2369
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Blog on bulk insert sort modes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2369) Blog on bulk insert sort modes

2022-08-23 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2369:
--
Fix Version/s: 0.12.1

> Blog on bulk insert sort modes
> --
>
> Key: HUDI-2369
> URL: https://issues.apache.org/jira/browse/HUDI-2369
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Blog on bulk insert sort modes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] yihua commented on pull request #6442: [HUDI-4449] Support DataSourceV2 Read for Spark3.2

2022-08-23 Thread GitBox



yihua commented on PR #6442:
URL: https://github.com/apache/hudi/pull/6442#issuecomment-1224494779

   @alexeykudinkin FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4496) ORC fails w/ Spark 3.1

2022-08-23 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4496:
--
Fix Version/s: 13.0

> ORC fails w/ Spark 3.1
> --
>
> Key: HUDI-4496
> URL: https://issues.apache.org/jira/browse/HUDI-4496
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 13.0
>
>
> After running TestHoodieSparkSqlWriter test for different Spark versions, 
> discovered that Orc version was incorrectly put as compile time dep on the 
> classpath, breaking Orc writing in Hudi in Spark 3.1:
> https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4389) Make HoodieStreamingSink idempotent

2022-08-23 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4389:
--
Sprint: 2022/08/22  (was: 2022/09/19)

> Make HoodieStreamingSink idempotent
> ---
>
> Key: HUDI-4389
> URL: https://issues.apache.org/jira/browse/HUDI-4389
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available, streaming
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2673) Add integration/e2e test for kafka-connect functionality

2022-08-23 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2673:
--
Sprint: Hudi-Sprint-Apr-19, Hudi-Sprint-Apr-25, 2022/05/02, 2022/05/16, 
2022/08/22  (was: Hudi-Sprint-Apr-19, Hudi-Sprint-Apr-25, 2022/05/02, 
2022/05/16)

> Add integration/e2e test for kafka-connect functionality
> 
>
> Key: HUDI-2673
> URL: https://issues.apache.org/jira/browse/HUDI-2673
> Project: Apache Hudi
>  Issue Type: Task
>  Components: kafka-connect, tests-ci
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> The integration test should use bundle jar and run in docker setup.  This can 
> prevent any issue in the bundle, like HUDI-3903, that is not covered by unit 
> and functional tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] yihua commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

2022-08-23 Thread GitBox



yihua commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1224476707

   @alexeykudinkin could you also review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4212) kafka-connect module: Unresolved dependency: 'jdk.tools:jdk.tools:jar:1.7'

2022-08-23 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4212:
--
Sprint: 2022/08/08, 2022/09/05  (was: 2022/08/08, 2022/08/22)

> kafka-connect module: Unresolved dependency: 'jdk.tools:jdk.tools:jar:1.7'
> --
>
> Key: HUDI-4212
> URL: https://issues.apache.org/jira/browse/HUDI-4212
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies, dev-experience, kafka-connect
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Project import first time and IDE complains about Unresolved dependency: 
> 'jdk.tools:jdk.tools:jar:1.7' for kafka-connect module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] rmahindra123 commented on issue #6348: [SUPPORT] Hudi error while running HoodieMultiTableDeltaStreamer: Commit 20220809112130103 failed and rolled-back !

2022-08-23 Thread GitBox



rmahindra123 commented on issue #6348:
URL: https://github.com/apache/hudi/issues/6348#issuecomment-1224406489

   For Multitable Deltastreamer, it runs the ingestion sequentially, so it will 
first ingest table1 and then table2. Let me know if you still are facing 
issues. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6135: [HUDI-4418] Add support for ProtoKafkaSource

2022-08-23 Thread GitBox



hudi-bot commented on PR #6135:
URL: https://github.com/apache/hudi/pull/6135#issuecomment-1224389948

   
   ## CI report:
   
   * d36fed637603d9959e8d049ac0815b9c729eb246 UNKNOWN
   * 14115a6f79de39f538ddfba407f84249c35ebca5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10881)
 
   * 1879403e5a33bfcaa6d9d1d3e6e2cbc226403f90 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10905)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6456: [HUDI-4674]Change the default value of inputFormat for the MOR table

2022-08-23 Thread GitBox



alexeykudinkin commented on code in PR #6456:
URL: https://github.com/apache/hudi/pull/6456#discussion_r952916053


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala:
##
@@ -120,10 +119,8 @@ object CreateHoodieTableCommand {
 
 val tableType = tableConfig.getTableType.name()
 val inputFormat = tableType match {
-  case DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL =>
+  case DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL | 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL =>

Review Comment:
   @linfey90 i don't think this change makes sense to me. 
   
   Can you please elaborate what you're trying to achieve here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6135: [HUDI-4418] Add support for ProtoKafkaSource

2022-08-23 Thread GitBox



hudi-bot commented on PR #6135:
URL: https://github.com/apache/hudi/pull/6135#issuecomment-1224382862

   
   ## CI report:
   
   * d36fed637603d9959e8d049ac0815b9c729eb246 UNKNOWN
   * 14115a6f79de39f538ddfba407f84249c35ebca5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10881)
 
   * 1879403e5a33bfcaa6d9d1d3e6e2cbc226403f90 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin closed pull request #6193: [WIP] Fixing logging dependencies and configs

2022-08-23 Thread GitBox



alexeykudinkin closed pull request #6193: [WIP] Fixing logging dependencies and 
configs
URL: https://github.com/apache/hudi/pull/6193


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #6193: [WIP] Fixing logging dependencies and configs

2022-08-23 Thread GitBox



alexeykudinkin commented on PR #6193:
URL: https://github.com/apache/hudi/pull/6193#issuecomment-1224359433

   Yeah, this could be closed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on pull request #6193: [WIP] Fixing logging dependencies and configs

2022-08-23 Thread GitBox



yihua commented on PR #6193:
URL: https://github.com/apache/hudi/pull/6193#issuecomment-1224348069

   @alexeykudinkin Is this still needed or replaced by #6170?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4586) Address S3 timeouts in Bloom Index with metadata table

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4586:
-
Story Points: 1  (was: 5)

> Address S3 timeouts in Bloom Index with metadata table
> --
>
> Key: HUDI-4586
> URL: https://issues.apache.org/jira/browse/HUDI-4586
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
> Attachments: Screen Shot 2022-08-15 at 17.39.01.png
>
>
> For partitioned table, there are significant number of S3 requests timeout 
> causing the upserts to fail when using Bloom Index with metadata table.
> {code:java}
> Load meta index key ranges for file slices: hudi
> collect at HoodieSparkEngineContext.java:137+details
> org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
> org.apache.hudi.client.common.HoodieSparkEngineContext.flatMap(HoodieSparkEngineContext.java:137)
> org.apache.hudi.index.bloom.HoodieBloomIndex.loadColumnRangesFromMetaIndex(HoodieBloomIndex.java:213)
> org.apache.hudi.index.bloom.HoodieBloomIndex.getBloomIndexFileInfoForPartitions(HoodieBloomIndex.java:145)
> org.apache.hudi.index.bloom.HoodieBloomIndex.lookupIndex(HoodieBloomIndex.java:123)
> org.apache.hudi.index.bloom.HoodieBloomIndex.tagLocation(HoodieBloomIndex.java:89)
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:49)
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:32)
> org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:53)
> org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:45)
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:113)
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:97)
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:155)
> org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:329)
> org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:183)
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>  {code}
> {code:java}
> org.apache.hudi.exception.HoodieException: Exception when reading log file 
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:352)
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:196)
>     at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.getRecordsByKeys(HoodieMetadataMergedLogRecordReader.java:124)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.readLogRecords(HoodieBackedTableMetadata.java:266)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$1(HoodieBackedTableMetadata.java:222)
>     at java.util.HashMap.forEach(HashMap.java:1290)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:209)
>     at 
> org.apache.hudi.metadata.BaseTableMetadata.getColumnStats(BaseTableMetadata.java:253)
>     at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadColumnRangesFromMetaIndex$cc8e7ca2$1(HoodieBloomIndex.java:224)
>     at 
> org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:137)
>     at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>     at

[jira] [Updated] (HUDI-4635) Update roadmap page based on H2 2022 plan

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4635:
-
Story Points: 0.5  (was: 1)

> Update roadmap page based on H2 2022 plan
> -
>
> Key: HUDI-4635
> URL: https://issues.apache.org/jira/browse/HUDI-4635
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3636) Clustering fails due to marker creation failure

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3636:
-
Story Points: 2  (was: 4)

> Clustering fails due to marker creation failure
> ---
>
> Key: HUDI-3636
> URL: https://issues.apache.org/jira/browse/HUDI-3636
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: multi-writer
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Scenario: multi-writer test, one writer doing ingesting with Deltastreamer 
> continuous mode, COW, inserts, async clustering and cleaning (partitions 
> under 2022/1, 2022/2), another writer with Spark datasource doing backfills 
> to different partitions (2021/12).  
> 0.10.0 no MT, clustering instant is inflight (failing it in the middle before 
> upgrade) ➝ 0.11 MT, with multi-writer configuration the same as before.
> The clustering/replace instant cannot make progress due to marker creation 
> failure, failing the DS ingestion as well.  Need to investigate if this is 
> timeline-server-based marker related or MT related.
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 46.0 failed 1 times, most recent failure: Lost task 2.0 in stage 46.0 
> (TID 277) (192.168.70.231 executor driver): java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE
> Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] 
> failed: Connection refused (Connection refused)
>     at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>     at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>     at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
>     at scala.collection.AbstractIterator.to(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
>     at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
>     at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
>     at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
>     at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
>     at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
>     at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030)
>     at 
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>     at org.apache.spark.scheduler.Task.run(Task.scala:131)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
>     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file 
> 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE
> Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] 
> failed: Connection refused (Connection refused)
>     at 
> org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:94)
>     at 
>

[jira] [Updated] (HUDI-4585) Optimize query performance on Presto Hudi connector

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4585:
-
Story Points: 0  (was: 2)

> Optimize  query performance on Presto Hudi connector
> 
>
> Key: HUDI-4585
> URL: https://issues.apache.org/jira/browse/HUDI-4585
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2955) Upgrade Hadoop to 3.3.x

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2955:
-
Sprint: Hudi-Sprint-Feb-14, Hudi-Sprint-Mar-14, Hudi-Sprint-Mar-21, 
Hudi-Sprint-Mar-22, Hudi-Sprint-Apr-05, Hudi-Sprint-Apr-19, Hudi-Sprint-Apr-25, 
2022/05/02, 2022/05/16, 2022/05/31  (was: Hudi-Sprint-Feb-14, 
Hudi-Sprint-Mar-14, Hudi-Sprint-Mar-21, Hudi-Sprint-Mar-22, Hudi-Sprint-Apr-05, 
Hudi-Sprint-Apr-19, Hudi-Sprint-Apr-25, 2022/05/02, 2022/05/16, 2022/05/31, 
2022/08/22)

> Upgrade Hadoop to 3.3.x
> ---
>
> Key: HUDI-2955
> URL: https://issues.apache.org/jira/browse/HUDI-2955
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Rahil Chertara
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2021-12-07 at 2.32.51 PM.png
>
>
> According to Hadoop compatibility matrix, this is a pre-requisite to 
> upgrading to JDK11:
> !Screen Shot 2021-12-07 at 2.32.51 PM.png|width=938,height=230!
> [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions]
>  
> *Upgrading Hadoop from 2.x to 3.x*
> [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.x+to+3.x+Upgrade+Efforts]
> Everything (relevant to us) seems to be in a good shape, except Spark 2.2/.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6352: [HUDI-4584] Fixing `SQLConf` not being propagated to executor

2022-08-23 Thread GitBox



alexeykudinkin commented on code in PR #6352:
URL: https://github.com/apache/hudi/pull/6352#discussion_r952852924


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/execution/SQLConfInjectingRDD.scala:
##
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.{Partition, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.internal.SQLConf
+
+import scala.reflect.ClassTag
+
+/**
+ * NOTE: This is a generalized version of of Spark's [[SQLExecutionRDD]]
+ *
+ * It is just a wrapper over [[sqlRDD]] which sets and makes effective all the 
configs from the
+ * captured [[SQLConf]]
+ *
+ * @param sqlRDD the `RDD` generated by the SQL plan
+ * @param conf the `SQLConf` to apply to the execution of the SQL plan
+ */
+class SQLConfInjectingRDD[T: ClassTag](var sqlRDD: RDD[T], @transient conf: 
SQLConf) extends RDD[T](sqlRDD) {
+  private val sqlConfigs = conf.getAllConfs
+  private lazy val sqlConfExecutorSide = {
+val newConf = new SQLConf()
+sqlConfigs.foreach { case (k, v) => newConf.setConfString(k, v) }
+newConf
+  }
+
+  override val partitioner = firstParent[InternalRow].partitioner
+
+  override def getPartitions: Array[Partition] = 
firstParent[InternalRow].partitions
+
+  override def compute(split: Partition, context: TaskContext): Iterator[T] = {
+// If we are in the context of a tracked SQL operation, 
`SQLExecution.EXECUTION_ID_KEY` is set
+// and we have nothing to do here. Otherwise, we use the `SQLConf` 
captured at the creation of
+// this RDD.
+if (context.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) == null) {
+  SQLConf.withExistingConf(sqlConfExecutorSide) {

Review Comment:
   Not sure i understood your question: do you mean whether we're wrapping any 
other chained RDD, or whether the SQLConf will get propagated to every other 
chained RDD?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pomaster commented on issue #6344: [SUPPORT] spark-sql schema_evolution

2022-08-23 Thread GitBox



pomaster commented on issue #6344:
URL: https://github.com/apache/hudi/issues/6344#issuecomment-1224294973

   @nsivabalan Looked like @KnightChess has updated the doc already.  Thanks 
@KnightChess.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4659) Develop a validation tool for bootstrap table

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4659:
-
Sprint: 2022/09/05  (was: 2022/08/22)

> Develop a validation tool for bootstrap table
> -
>
> Key: HUDI-4659
> URL: https://issues.apache.org/jira/browse/HUDI-4659
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1369) Bootstrap datasource jobs from hanging via spark-submit

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1369:
-
Sprint: 2022/09/05  (was: 2022/08/22)

> Bootstrap datasource jobs from hanging via spark-submit
> ---
>
> Key: HUDI-1369
> URL: https://issues.apache.org/jira/browse/HUDI-1369
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> MOR table creation via Hudi datasource hangs at the end of the spark-submit 
> job.
> Looks like {{HoodieWriteClient}} at 
> [https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255]
>  not being closed which does not stop the timeline server at the end, and as 
> a result the job hangs and never exits.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4125) Add IT (Azure CI) around bootstrapped Hudi table

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4125:
-
Sprint: 2022/09/05  (was: 2022/08/22)

> Add IT (Azure CI) around bootstrapped Hudi table
> 
>
> Key: HUDI-4125
> URL: https://issues.apache.org/jira/browse/HUDI-4125
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>
> For bootstrapped Hudi table with bootstrap format, the table can be queried 
> through different engines without any issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6481: [HUDI-4698] Rename the package 'org.apache.flink.table.data' to avoid…

2022-08-23 Thread GitBox



hudi-bot commented on PR #6481:
URL: https://github.com/apache/hudi/pull/6481#issuecomment-1224281869

   
   ## CI report:
   
   * 3eb012affd4283f9970445bf3dbf4cb48afc25bf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10903)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6480: [HUDI-4687] add show_invalid_parquet procedure

2022-08-23 Thread GitBox



hudi-bot commented on PR #6480:
URL: https://github.com/apache/hudi/pull/6480#issuecomment-1224281819

   
   ## CI report:
   
   * 9d161840463bb97d4872ce8a2c376cb9e0d00440 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10904)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rmahindra123 commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

2022-08-23 Thread GitBox



rmahindra123 commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1224268630

   Confirmed that #6352 resolves the issue after adding the following config:
   --conf spark.sql.avro.datetimeRebaseModeInWrite=LEGACY


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4585) Optimize query performance on Presto Hudi connector

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4585:
-
Story Points: 2  (was: 10)

> Optimize  query performance on Presto Hudi connector
> 
>
> Key: HUDI-4585
> URL: https://issues.apache.org/jira/browse/HUDI-4585
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] rmahindra123 commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3

2022-08-23 Thread GitBox



rmahindra123 commented on issue #6278:
URL: https://github.com/apache/hudi/issues/6278#issuecomment-1224263388

   Was able to reproduce by adding the following line in my source:
   newDataSet = newDataSet.withColumn("invalidDates", 
functions.lit("1000-01-11").cast(DataTypes.DateType));
   
   Full stacktrace here: 
https://gist.github.com/rmahindra123/4ab3614ef6ce30ee2c72499f2633de57


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4468) Simplify TimeTravel logic for Spark 3.3

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4468:
-
Sprint: 2022/09/19  (was: 2022/08/22)

> Simplify TimeTravel logic for Spark 3.3
> ---
>
> Key: HUDI-4468
> URL: https://issues.apache.org/jira/browse/HUDI-4468
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Shawn Chang
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> Existing Hudi relies on .g4 files and antlr classes to make time travel work 
> for Spark 3.2 
> As time travel is supported on Spark 3.3. Those logic can be greatly 
> simplified and some of them can also be removed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4467) Port borrowed code from Spark 3.3

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4467:
-
Sprint: 2022/09/19  (was: 2022/08/22)

> Port borrowed code from Spark 3.3
> -
>
> Key: HUDI-4467
> URL: https://issues.apache.org/jira/browse/HUDI-4467
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Shawn Chang
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> Currently some classes are copied from Spark32 module w/o/w only necessary 
> changes. we should port them from Spark 3.3 to use the latest implementation 
> in Spark
>  
> Classes copied:
> Spark33NestedSchemaPruning



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4465) Optimizing file-listing path in MT

2022-08-23 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4465:
--
Story Points: 2  (was: 4)

> Optimizing file-listing path in MT
> --
>
> Key: HUDI-4465
> URL: https://issues.apache.org/jira/browse/HUDI-4465
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We should review file-listing path and try to optimize the file-listing path 
> as much as possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4588) Ingestion failing if source column is dropped

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4588:
-
Story Points: 4  (was: 12)

> Ingestion failing if source column is dropped
> -
>
> Key: HUDI-4588
> URL: https://issues.apache.org/jira/browse/HUDI-4588
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Vamshi Gudavarthi
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available, schema, schema-evolution
> Fix For: 0.12.1
>
> Attachments: schema_stage1.avsc, schema_stage2.avsc, stage_1.json, 
> stage_2.json
>
>
> Ingestion using Deltastreamer fails if columns are dropped from source. I had 
> reproduced using docker-demo setup. Below are the steps for reproducing it.
>  # I had created data file `stage_1.json`(attached), ingested it to kafka and 
> ingested to hudi-table from kafka using Deltastreamer job(using 
> FileschemaProvider with `schema_stage1.avsc`)
>  # Simulating column dropping from source in the next step.
>  #  Repeat steps in step1 with stage2 files. Stage2 files doesn't have `day` 
> column, Ingestion job failed. Below is detailed stacktrace.
> {code:java}
> Driver stacktrace:
>     at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
>     at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
>     at scala.Option.foreach(Option.scala:257)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>     at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1098)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>     at org.apache.spark.rdd.RDD.fold(RDD.scala:1092)
>     at 
> org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply$mcD$sp(DoubleRDDFunctions.scala:35)
>     at 
> org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:35)
>     at 
> org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:35)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>     at 
> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:34)
>     at org.apache.spark.api.java.JavaDoubleRDD.sum(JavaDoubleRDD.scala:165)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:607)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:335)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:201)
>     at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:199)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:557)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)

[jira] [Updated] (HUDI-4691) Deduplicate Spark 3.2 and Spark 3.3 integrations

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4691:
-
Story Points: 6  (was: 12)

> Deduplicate Spark 3.2 and Spark 3.3 integrations
> 
>
> Key: HUDI-4691
> URL: https://issues.apache.org/jira/browse/HUDI-4691
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> While adding support for Spark 3.3, considerable portion of the 
> version-specific integration was simply copied over from Spark 3.2 one, w/o 
> deliberation whether this is required or not.
> We should address such duplication ASAP, to make sure that only necessary 
> pieces are duplicated to handle version specific behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4690) Remove code duplicated over from Spark

2022-08-23 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4690:
--
Story Points: 12  (was: 5)

> Remove code duplicated over from Spark
> --
>
> Key: HUDI-4690
> URL: https://issues.apache.org/jira/browse/HUDI-4690
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> At present, a lot of code in `HoodieAnalysis` is unnecessarily duplicating 
> the resolution logic from Spark that leads to interference w/ normal 
> operations of Spark's Analyzer and leading to non-trivial issues (like 
> HUDI-4503) when dealing w/ Spark or Spark SQL
>  
> We should minimize the amount of logic and code that is localized from Spark 
> to Hudi to strictly necessary to either 
>  # Address issues (alternative to upstreaming in Spark)
>  # Back-port features (from newer Spark versions to older ones)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-2955) Upgrade Hadoop to 3.3.x

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2955:
-
Reviewers: Ethan Guo  (was: Alexey Kudinkin, Ethan Guo)

> Upgrade Hadoop to 3.3.x
> ---
>
> Key: HUDI-2955
> URL: https://issues.apache.org/jira/browse/HUDI-2955
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Rahil Chertara
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2021-12-07 at 2.32.51 PM.png
>
>
> According to Hadoop compatibility matrix, this is a pre-requisite to 
> upgrading to JDK11:
> !Screen Shot 2021-12-07 at 2.32.51 PM.png|width=938,height=230!
> [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions]
>  
> *Upgrading Hadoop from 2.x to 3.x*
> [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.x+to+3.x+Upgrade+Efforts]
> Everything (relevant to us) seems to be in a good shape, except Spark 2.2/.3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4584) SQLConf is not propagated correctly into RDDs

2022-08-23 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4584:
--
Story Points: 6  (was: 8)

> SQLConf is not propagated correctly into RDDs
> -
>
> Key: HUDI-4584
> URL: https://issues.apache.org/jira/browse/HUDI-4584
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> There were a few reports and slack as well as in GI, related to Spark SQL 
> configs not being respected by DeltaStreamer while working perfectly fine 
> when leveraging DataSource API:
> [https://github.com/apache/hudi/issues/6278]
>  
> I was able to trace these down to
>  # `HoodieSparkUtils.createRDD` instantiating `AvroSerializer` which uses 
> SQLConf that isn't propagated by Spark properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4691) Deduplicate Spark 3.2 and Spark 3.3 integrations

2022-08-23 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4691:
--
Story Points: 12  (was: 3)

> Deduplicate Spark 3.2 and Spark 3.3 integrations
> 
>
> Key: HUDI-4691
> URL: https://issues.apache.org/jira/browse/HUDI-4691
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: reader-core, writer-core
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
>
> While adding support for Spark 3.3, considerable portion of the 
> version-specific integration was simply copied over from Spark 3.2 one, w/o 
> deliberation whether this is required or not.
> We should address such duplication ASAP, to make sure that only necessary 
> pieces are duplicated to handle version specific behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4588) Ingestion failing if source column is dropped

2022-08-23 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4588:
--
Story Points: 12  (was: 5)

> Ingestion failing if source column is dropped
> -
>
> Key: HUDI-4588
> URL: https://issues.apache.org/jira/browse/HUDI-4588
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: deltastreamer
>Reporter: Vamshi Gudavarthi
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available, schema, schema-evolution
> Fix For: 0.12.1
>
> Attachments: schema_stage1.avsc, schema_stage2.avsc, stage_1.json, 
> stage_2.json
>
>
> Ingestion using Deltastreamer fails if columns are dropped from source. I had 
> reproduced using docker-demo setup. Below are the steps for reproducing it.
>  # I had created data file `stage_1.json`(attached), ingested it to kafka and 
> ingested to hudi-table from kafka using Deltastreamer job(using 
> FileschemaProvider with `schema_stage1.avsc`)
>  # Simulating column dropping from source in the next step.
>  #  Repeat steps in step1 with stage2 files. Stage2 files doesn't have `day` 
> column, Ingestion job failed. Below is detailed stacktrace.
> {code:java}
> Driver stacktrace:
>     at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
>     at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
>     at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
>     at scala.Option.foreach(Option.scala:257)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>     at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1098)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>     at org.apache.spark.rdd.RDD.fold(RDD.scala:1092)
>     at 
> org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply$mcD$sp(DoubleRDDFunctions.scala:35)
>     at 
> org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:35)
>     at 
> org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:35)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>     at 
> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:34)
>     at org.apache.spark.api.java.JavaDoubleRDD.sum(JavaDoubleRDD.scala:165)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:607)
>     at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:335)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:201)
>     at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:199)
>     at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:557)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at

[jira] [Updated] (HUDI-4503) Support table identifier with explicit catalog

2022-08-23 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4503:
--
Story Points: 4  (was: 2)

> Support table identifier with explicit catalog
> --
>
> Key: HUDI-4503
> URL: https://issues.apache.org/jira/browse/HUDI-4503
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark, spark-sql
>Reporter: Yann Byron
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4626) Partitioning table by `_hoodie_partition_path` fails

2022-08-23 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4626:
--
Story Points: 4  (was: 2)

> Partitioning table by `_hoodie_partition_path` fails
> 
>
> Key: HUDI-4626
> URL: https://issues.apache.org/jira/browse/HUDI-4626
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.12.1
>
>
>  
> Currently, creating a table partitioned by "_hoodie_partition_path" using 
> Glue catalog fails w/ the following exception:
> {code:java}
> AnalysisException: Found duplicate column(s) in the data schema and the 
> partition schema: _hoodie_partition_path
> {code}
> Using following DDL:
> {code:java}
> CREATE EXTERNAL TABLE `active_storage_attachments`(  `_hoodie_commit_time` 
> string COMMENT '',   `_hoodie_commit_seqno` string COMMENT '',   
> `_hoodie_record_key` string COMMENT '',   `_hoodie_file_name` string COMMENT 
> '',   `_change_operation_type` string COMMENT '',   
> `_upstream_event_processed_ts_ms` bigint COMMENT '',   
> `db_shard_source_partition` string COMMENT '',   `_event_origin_ts_ms` bigint 
> COMMENT '',   `_event_tx_id` bigint COMMENT '',   `_event_lsn` bigint COMMENT 
> '',   `_event_xmin` bigint COMMENT '',   `id` bigint COMMENT '',   `name` 
> string COMMENT '',   `record_type` string COMMENT '',   `record_id` bigint 
> COMMENT '',   `blob_id` bigint COMMENT '',   `created_at` timestamp COMMENT 
> '')PARTITIONED BY (   `_hoodie_partition_path` string COMMENT '')ROW FORMAT 
> SERDE   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH 
> SERDEPROPERTIES (   'hoodie.query.as.ro.table'='false',   'path'='...') 
> STORED AS INPUTFORMAT   'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
> OUTPUTFORMAT   
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION  
> '...'
> TBLPROPERTIES (  'spark.sql.sources.provider'='hudi' )
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4584) SQLConf is not propagated correctly into RDDs

2022-08-23 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4584:
--
Story Points: 8  (was: 4)

> SQLConf is not propagated correctly into RDDs
> -
>
> Key: HUDI-4584
> URL: https://issues.apache.org/jira/browse/HUDI-4584
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> There were a few reports and slack as well as in GI, related to Spark SQL 
> configs not being respected by DeltaStreamer while working perfectly fine 
> when leveraging DataSource API:
> [https://github.com/apache/hudi/issues/6278]
>  
> I was able to trace these down to
>  # `HoodieSparkUtils.createRDD` instantiating `AvroSerializer` which uses 
> SQLConf that isn't propagated by Spark properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4364) integrate column stats index with presto engine

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4364:
-
Sprint:   (was: 2022/08/22)

> integrate column stats index with presto engine
> ---
>
> Key: HUDI-4364
> URL: https://issues.apache.org/jira/browse/HUDI-4364
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: metadata, reader-core
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-3397) Make sure Spark RDDs triggering actual FS activity are only dereferenced once

2022-08-23 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3397:
--
Sprint: 2022/09/05

> Make sure Spark RDDs triggering actual FS activity are only dereferenced once
> -
>
> Key: HUDI-3397
> URL: https://issues.apache.org/jira/browse/HUDI-3397
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: spark
> Fix For: 0.13.0
>
>
> Currently, RDD `collect()` operation is treated quite loosely and there are 
> multiple flows which used to dereference RDDs (for ex, through `collect`, 
> `count`, etc) that way triggering the same operations being carried out 
> multiple times, occasionally duplicating the output already persisted on FS.
> Check out HUDI-3370 for recent example.
> NOTE: Even though Spark caching is supposed to make sure that we aren't 
> writing to FS multiple times, we can't solely rely on caching to guarantee 
> exactly once execution.
> Instead, we should make sure that RDDs are only dereferenced {*}once{*}, w/in 
> "commit" operation and all the other operations are only relying on 
> _derivative_ data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4465) Optimizing file-listing path in MT

2022-08-23 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4465:
--
Sprint: 2022/08/22

> Optimizing file-listing path in MT
> --
>
> Key: HUDI-4465
> URL: https://issues.apache.org/jira/browse/HUDI-4465
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> We should review file-listing path and try to optimize the file-listing path 
> as much as possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4467) Port borrowed code from Spark 3.3

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4467:
-
Story Points: 5

> Port borrowed code from Spark 3.3
> -
>
> Key: HUDI-4467
> URL: https://issues.apache.org/jira/browse/HUDI-4467
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Shawn Chang
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> Currently some classes are copied from Spark32 module w/o/w only necessary 
> changes. we should port them from Spark 3.3 to use the latest implementation 
> in Spark
>  
> Classes copied:
> Spark33NestedSchemaPruning



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4468) Simplify TimeTravel logic for Spark 3.3

2022-08-23 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4468:
-
Sprint: 2022/08/22  (was: 2022/09/19)

> Simplify TimeTravel logic for Spark 3.3
> ---
>
> Key: HUDI-4468
> URL: https://issues.apache.org/jira/browse/HUDI-4468
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Shawn Chang
>Assignee: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.1
>
>
> Existing Hudi relies on .g4 files and antlr classes to make time travel work 
> for Spark 3.2 
> As time travel is supported on Spark 3.3. Those logic can be greatly 
> simplified and some of them can also be removed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 3 >

1 - 100 of 238 matches

Mail list logo