[GitHub] [hudi] hudi-bot commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
hudi-bot commented on PR #6000: URL: https://github.com/apache/hudi/pull/6000#issuecomment-1225193356 ## CI report: * 06f352b0235cbbac215174c2755fca24009799c5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10912) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4622: [SUPPORT] Can't query Redshift rows even after downgrade from 0.10
nsivabalan commented on issue #4622: URL: https://github.com/apache/hudi/issues/4622#issuecomment-1225182724 thanks @nochimow for the update. appreciate it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] fengjian428 commented on issue #6441: Status on PR: 2666: Support update partial fields for CoW table
fengjian428 commented on issue #6441: URL: https://github.com/apache/hudi/issues/6441#issuecomment-1225175998 > What I understand -> OverwriteNonDefaultsWithLatestAvroPayload can update the non-null fields in the new data(cdc) to the old data(Hudi table) But what if I have multiple changes for the same Record key into new cdc data then it won't give me correct output. > > For example: Hudi Table: RK1, F1, F2, F3, F4, F5 > > New cdc data: RK1, null, null, F3', null, F5' RK1, F1', null, F3", null, null RK1, null, F2', null, F4', F5" > > So Expected output of Record key(RK1) row in Hudi Table would be: RK1, F1', F2', F3", F4', F5" > > Is there any future plan to merge following work into Hudi master which can help us to get partial updates ? #2666 try turn off hoodie.combine.before.upsert? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] brskiran1 commented on issue #6304: Hudi MultiTable Deltastreamer not updating glue catalog when new column added on Source
brskiran1 commented on issue #6304: URL: https://github.com/apache/hudi/issues/6304#issuecomment-1225173640 @rmahindra123 responding on behalf of @SubashRanganathan . we have tried this without the flag hoodie.schema.on.read.enable set to true. Still dont see glue catalog updated with new column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-4698) Rename the package 'org.apache.flink.table.data' to avoid conflicts with flink table core
[ https://issues.apache.org/jira/browse/HUDI-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583983#comment-17583983 ] Danny Chen commented on HUDI-4698: -- Fixed via master branch: 822c1397e04936b89fda771bb1c269de5fb0dd4b > Rename the package 'org.apache.flink.table.data' to avoid conflicts with > flink table core > - > > Key: HUDI-4698 > URL: https://issues.apache.org/jira/browse/HUDI-4698 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-4698) Rename the package 'org.apache.flink.table.data' to avoid conflicts with flink table core
[ https://issues.apache.org/jira/browse/HUDI-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen resolved HUDI-4698. -- > Rename the package 'org.apache.flink.table.data' to avoid conflicts with > flink table core > - > > Key: HUDI-4698 > URL: https://issues.apache.org/jira/browse/HUDI-4698 > Project: Apache Hudi > Issue Type: Improvement > Components: flink >Reporter: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated (16a80e6d41 -> 822c1397e0)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 16a80e6d41 [HUDI-4637] Release thread in RateLimiter doesn't been terminated (#6433) add 822c1397e0 [HUDI-4698] Rename the package 'org.apache.flink.table.data' to avoid conflicts with flink table core (#6481) No new revisions were added by this update. Summary of changes: .../{flink => hudi}/table/data/ColumnarArrayData.java| 16 .../{flink => hudi}/table/data/ColumnarMapData.java | 4 +++- .../{flink => hudi}/table/data/ColumnarRowData.java | 12 ++-- .../table/data/vector/MapColumnVector.java | 3 ++- .../table/data/vector/RowColumnVector.java | 6 -- .../table/data/vector/VectorizedColumnBatch.java | 14 +- .../hudi/table/format/cow/ParquetSplitReaderUtil.java| 2 +- .../hudi/table/format/cow/vector/HeapArrayVector.java| 3 ++- .../table/format/cow/vector/HeapMapColumnVector.java | 5 +++-- .../table/format/cow/vector/HeapRowColumnVector.java | 7 --- .../format/cow/vector/reader/ArrayColumnReader.java | 2 +- .../cow/vector/reader/ParquetColumnarRowSplitReader.java | 4 ++-- 12 files changed, 57 insertions(+), 21 deletions(-) rename hudi-flink-datasource/hudi-flink1.13.x/src/main/java/org/apache/{flink => hudi}/table/data/ColumnarArrayData.java (93%) rename hudi-flink-datasource/hudi-flink1.13.x/src/main/java/org/apache/{flink => hudi}/table/data/ColumnarMapData.java (94%) rename hudi-flink-datasource/hudi-flink1.13.x/src/main/java/org/apache/{flink => hudi}/table/data/ColumnarRowData.java (93%) rename hudi-flink-datasource/hudi-flink1.13.x/src/main/java/org/apache/{flink => hudi}/table/data/vector/MapColumnVector.java (90%) rename hudi-flink-datasource/hudi-flink1.13.x/src/main/java/org/apache/{flink => hudi}/table/data/vector/RowColumnVector.java (85%) rename hudi-flink-datasource/hudi-flink1.13.x/src/main/java/org/apache/{flink => hudi}/table/data/vector/VectorizedColumnBatch.java (84%)
[GitHub] [hudi] namuny commented on issue #6212: [SUPPORT] Hudi creates duplicate, redundant file during clustering
namuny commented on issue #6212: URL: https://github.com/apache/hudi/issues/6212#issuecomment-1225170438 Gentle bump to see if anyone has any further recommendations on what information we could provide to help with reproducing the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 merged pull request #6481: [HUDI-4698] Rename the package 'org.apache.flink.table.data' to avoid…
danny0405 merged PR #6481: URL: https://github.com/apache/hudi/pull/6481 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #6481: [HUDI-4698] Rename the package 'org.apache.flink.table.data' to avoid…
danny0405 commented on PR #6481: URL: https://github.com/apache/hudi/pull/6481#issuecomment-1225170342 The failed test case is flaky, should not be caused by this patch, would merge this PR and fix it in another PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
hudi-bot commented on PR #6000: URL: https://github.com/apache/hudi/pull/6000#issuecomment-1225160370 ## CI report: * b54e1a1397b1294cc4dc6e28bdfea7fb4ccaceab Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10892) * 06f352b0235cbbac215174c2755fca24009799c5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10912) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
hudi-bot commented on PR #6000: URL: https://github.com/apache/hudi/pull/6000#issuecomment-1225157843 ## CI report: * b54e1a1397b1294cc4dc6e28bdfea7fb4ccaceab Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10892) * 06f352b0235cbbac215174c2755fca24009799c5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] TengHuo commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
TengHuo commented on PR #6000: URL: https://github.com/apache/hudi/pull/6000#issuecomment-1225157773 Done, updated the method `parseDateFromInstantTimeSafely`, it will log a warning message and return `Option.empty` when get an invalid timestamp, so won't output metrics when the timestamp is invalid. And rebased the code to the latest master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on issue #6479: [SUPPORT] How to query the previous SNAPSHOT in Hive
Zouxxyy commented on issue #6479: URL: https://github.com/apache/hudi/issues/6479#issuecomment-1225129309 I guess it's still under development -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] TengHuo commented on a diff in pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
TengHuo commented on code in PR #6000: URL: https://github.com/apache/hudi/pull/6000#discussion_r953286608 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java: ## @@ -75,16 +75,56 @@ public class HoodieActiveTimeline extends HoodieDefaultTimeline { REQUESTED_REPLACE_COMMIT_EXTENSION, INFLIGHT_REPLACE_COMMIT_EXTENSION, REPLACE_COMMIT_EXTENSION, REQUESTED_INDEX_COMMIT_EXTENSION, INFLIGHT_INDEX_COMMIT_EXTENSION, INDEX_COMMIT_EXTENSION, REQUESTED_SAVE_SCHEMA_ACTION_EXTENSION, INFLIGHT_SAVE_SCHEMA_ACTION_EXTENSION, SAVE_SCHEMA_ACTION_EXTENSION)); + + private static final Set NOT_PARSABLE_TIMESTAMPS = new HashSet(3) {{ + add(HoodieTimeline.INIT_INSTANT_TS); + add(HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS); + add(HoodieTimeline.FULL_BOOTSTRAP_INSTANT_TS); +}}; + private static final Logger LOG = LogManager.getLogger(HoodieActiveTimeline.class); protected HoodieTableMetaClient metaClient; /** * Parse the timestamp of an Instant and return a {@code Date}. + * Throw ParseException if timestamp not valid format as + * {@link org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}. + * + * @param timestamp a timestamp String which follow pattern as + * {@link org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}. + * @return Date of instant timestamp */ public static Date parseDateFromInstantTime(String timestamp) throws ParseException { return HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp); } + /** + * The same format method as above, but this method will mute ParseException + * if the gaven timestamp is invalid and return Date(0), or a corresponding Date if these timestamp provided + * {@link org.apache.hudi.common.table.timeline.HoodieTimeline#INIT_INSTANT_TS}, + * {@link org.apache.hudi.common.table.timeline.HoodieTimeline#METADATA_BOOTSTRAP_INSTANT_TS}, + * {@link org.apache.hudi.common.table.timeline.HoodieTimeline#FULL_BOOTSTRAP_INSTANT_TS}. + * This method is useful when parse timestamp for metrics + * + * @param timestamp a timestamp String which follow pattern as + * {@link org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}. + * @return Date of instant timestamp + */ + public static Date parseDateFromInstantTimeSafely(String timestamp) { +Date parsedDate; +try { + parsedDate = HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp); +} catch (ParseException e) { + LOG.warn("Failed to parse timestamp " + timestamp + " because of " + e.getMessage()); + if (NOT_PARSABLE_TIMESTAMPS.contains(timestamp)) { +parsedDate = new Date(Integer.parseInt(timestamp)); + } else { +parsedDate = new Date(0); Review Comment: It's the old logic in `HoodieInstantTimeGenerator.parseDateFromInstantTime`, if it catch the error and the timestamp is all zero, it will return `Date(0)`, so I keep it. ```java // Special handling for all zero timestamp which is not parsable by DateTimeFormatter if (timestamp.equals(ALL_ZERO_TIMESTAMP)) { return new Date(0); } throw e; ``` but I agree with you, it will return a dirty value, which is bad for the code where it uses this method. `parseDateFromInstantTimeSafely` should return an optional value, then the code who use this method can decide how to deal with Option.empty. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
danny0405 commented on code in PR #6000: URL: https://github.com/apache/hudi/pull/6000#discussion_r953281948 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java: ## @@ -75,16 +75,56 @@ public class HoodieActiveTimeline extends HoodieDefaultTimeline { REQUESTED_REPLACE_COMMIT_EXTENSION, INFLIGHT_REPLACE_COMMIT_EXTENSION, REPLACE_COMMIT_EXTENSION, REQUESTED_INDEX_COMMIT_EXTENSION, INFLIGHT_INDEX_COMMIT_EXTENSION, INDEX_COMMIT_EXTENSION, REQUESTED_SAVE_SCHEMA_ACTION_EXTENSION, INFLIGHT_SAVE_SCHEMA_ACTION_EXTENSION, SAVE_SCHEMA_ACTION_EXTENSION)); + + private static final Set NOT_PARSABLE_TIMESTAMPS = new HashSet(3) {{ + add(HoodieTimeline.INIT_INSTANT_TS); + add(HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS); + add(HoodieTimeline.FULL_BOOTSTRAP_INSTANT_TS); +}}; + private static final Logger LOG = LogManager.getLogger(HoodieActiveTimeline.class); protected HoodieTableMetaClient metaClient; /** * Parse the timestamp of an Instant and return a {@code Date}. + * Throw ParseException if timestamp not valid format as + * {@link org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}. + * + * @param timestamp a timestamp String which follow pattern as + * {@link org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}. + * @return Date of instant timestamp */ public static Date parseDateFromInstantTime(String timestamp) throws ParseException { return HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp); } + /** + * The same format method as above, but this method will mute ParseException + * if the gaven timestamp is invalid and return Date(0), or a corresponding Date if these timestamp provided + * {@link org.apache.hudi.common.table.timeline.HoodieTimeline#INIT_INSTANT_TS}, + * {@link org.apache.hudi.common.table.timeline.HoodieTimeline#METADATA_BOOTSTRAP_INSTANT_TS}, + * {@link org.apache.hudi.common.table.timeline.HoodieTimeline#FULL_BOOTSTRAP_INSTANT_TS}. + * This method is useful when parse timestamp for metrics + * + * @param timestamp a timestamp String which follow pattern as + * {@link org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}. + * @return Date of instant timestamp + */ + public static Date parseDateFromInstantTimeSafely(String timestamp) { +Date parsedDate; +try { + parsedDate = HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp); +} catch (ParseException e) { + LOG.warn("Failed to parse timestamp " + timestamp + " because of " + e.getMessage()); + if (NOT_PARSABLE_TIMESTAMPS.contains(timestamp)) { +parsedDate = new Date(Integer.parseInt(timestamp)); + } else { +parsedDate = new Date(0); Review Comment: What is the meanings of reporting a `Date(0)` which is a dirty data for metrics i think, instead we should not report any metrics at all in this case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] TengHuo commented on a diff in pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
TengHuo commented on code in PR #6000: URL: https://github.com/apache/hudi/pull/6000#discussion_r953269114 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java: ## @@ -75,16 +75,56 @@ public class HoodieActiveTimeline extends HoodieDefaultTimeline { REQUESTED_REPLACE_COMMIT_EXTENSION, INFLIGHT_REPLACE_COMMIT_EXTENSION, REPLACE_COMMIT_EXTENSION, REQUESTED_INDEX_COMMIT_EXTENSION, INFLIGHT_INDEX_COMMIT_EXTENSION, INDEX_COMMIT_EXTENSION, REQUESTED_SAVE_SCHEMA_ACTION_EXTENSION, INFLIGHT_SAVE_SCHEMA_ACTION_EXTENSION, SAVE_SCHEMA_ACTION_EXTENSION)); + + private static final Set NOT_PARSABLE_TIMESTAMPS = new HashSet(3) {{ + add(HoodieTimeline.INIT_INSTANT_TS); + add(HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS); + add(HoodieTimeline.FULL_BOOTSTRAP_INSTANT_TS); +}}; + private static final Logger LOG = LogManager.getLogger(HoodieActiveTimeline.class); protected HoodieTableMetaClient metaClient; /** * Parse the timestamp of an Instant and return a {@code Date}. + * Throw ParseException if timestamp not valid format as + * {@link org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}. + * + * @param timestamp a timestamp String which follow pattern as + * {@link org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}. + * @return Date of instant timestamp */ public static Date parseDateFromInstantTime(String timestamp) throws ParseException { return HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp); } + /** + * The same format method as above, but this method will mute ParseException + * if the gaven timestamp is invalid and return Date(0), or a corresponding Date if these timestamp provided + * {@link org.apache.hudi.common.table.timeline.HoodieTimeline#INIT_INSTANT_TS}, + * {@link org.apache.hudi.common.table.timeline.HoodieTimeline#METADATA_BOOTSTRAP_INSTANT_TS}, + * {@link org.apache.hudi.common.table.timeline.HoodieTimeline#FULL_BOOTSTRAP_INSTANT_TS}. + * This method is useful when parse timestamp for metrics + * + * @param timestamp a timestamp String which follow pattern as + * {@link org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}. + * @return Date of instant timestamp + */ + public static Date parseDateFromInstantTimeSafely(String timestamp) { +Date parsedDate; +try { + parsedDate = HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp); +} catch (ParseException e) { + LOG.warn("Failed to parse timestamp " + timestamp + " because of " + e.getMessage()); Review Comment: got it, np. and let me rebase my code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-4637) Release thread in RateLimiter is not terminated
[ https://issues.apache.org/jira/browse/HUDI-4637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-4637. - Resolution: Fixed > Release thread in RateLimiter is not terminated > --- > > Key: HUDI-4637 > URL: https://issues.apache.org/jira/browse/HUDI-4637 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: xi chaomin >Assignee: xi chaomin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > > When I use hbase index, I find the job can't be finished. I set log level to > DEBUG and see endless printing > {code:java} > 22/08/17 18:26:45 DEBUG RateLimiter: Release permits: maxPremits: 100, > available: 100 > 22/08/17 18:26:45 DEBUG RateLimiter: Release permits: maxPremits: 1000, > available: 1000 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated (ca8a57a21d -> 16a80e6d41)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from ca8a57a21d [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy (#6267) add 16a80e6d41 [HUDI-4637] Release thread in RateLimiter doesn't been terminated (#6433) No new revisions were added by this update. Summary of changes: .../org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java| 11 --- 1 file changed, 8 insertions(+), 3 deletions(-)
[GitHub] [hudi] nsivabalan merged pull request #6433: [HUDI-4637] Release thread in RateLimiter is not terminated
nsivabalan merged PR #6433: URL: https://github.com/apache/hudi/pull/6433 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] linfey90 commented on a diff in pull request #6456: [HUDI-4674]Change the default value of inputFormat for the MOR table
linfey90 commented on code in PR #6456: URL: https://github.com/apache/hudi/pull/6456#discussion_r953253259 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala: ## @@ -120,10 +119,8 @@ object CreateHoodieTableCommand { val tableType = tableConfig.getTableType.name() val inputFormat = tableType match { - case DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL => + case DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL | DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL => Review Comment: In hive queries, the original table name is used instead of the suffix _rt _ro table name. at this point we will choose to skip the _ro table. I also think hive offline tasks should use optimization table, so its inputFormat default value should be HoodieParquetInputFormat.Also If this default value there are other considerations,I will compare and modify when use hive sync metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #5920: [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool
nsivabalan commented on PR #5920: URL: https://github.com/apache/hudi/pull/5920#issuecomment-1225077158 hey @kk17 : is there any updates on this patch. once its ready, let me know. I can take another look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a diff in pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
nsivabalan commented on code in PR #6000: URL: https://github.com/apache/hudi/pull/6000#discussion_r953246798 ## hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java: ## @@ -75,16 +75,56 @@ public class HoodieActiveTimeline extends HoodieDefaultTimeline { REQUESTED_REPLACE_COMMIT_EXTENSION, INFLIGHT_REPLACE_COMMIT_EXTENSION, REPLACE_COMMIT_EXTENSION, REQUESTED_INDEX_COMMIT_EXTENSION, INFLIGHT_INDEX_COMMIT_EXTENSION, INDEX_COMMIT_EXTENSION, REQUESTED_SAVE_SCHEMA_ACTION_EXTENSION, INFLIGHT_SAVE_SCHEMA_ACTION_EXTENSION, SAVE_SCHEMA_ACTION_EXTENSION)); + + private static final Set NOT_PARSABLE_TIMESTAMPS = new HashSet(3) {{ + add(HoodieTimeline.INIT_INSTANT_TS); + add(HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS); + add(HoodieTimeline.FULL_BOOTSTRAP_INSTANT_TS); +}}; + private static final Logger LOG = LogManager.getLogger(HoodieActiveTimeline.class); protected HoodieTableMetaClient metaClient; /** * Parse the timestamp of an Instant and return a {@code Date}. + * Throw ParseException if timestamp not valid format as + * {@link org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}. + * + * @param timestamp a timestamp String which follow pattern as + * {@link org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}. + * @return Date of instant timestamp */ public static Date parseDateFromInstantTime(String timestamp) throws ParseException { return HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp); } + /** + * The same format method as above, but this method will mute ParseException + * if the gaven timestamp is invalid and return Date(0), or a corresponding Date if these timestamp provided + * {@link org.apache.hudi.common.table.timeline.HoodieTimeline#INIT_INSTANT_TS}, + * {@link org.apache.hudi.common.table.timeline.HoodieTimeline#METADATA_BOOTSTRAP_INSTANT_TS}, + * {@link org.apache.hudi.common.table.timeline.HoodieTimeline#FULL_BOOTSTRAP_INSTANT_TS}. + * This method is useful when parse timestamp for metrics + * + * @param timestamp a timestamp String which follow pattern as + * {@link org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator#SECS_INSTANT_TIMESTAMP_FORMAT}. + * @return Date of instant timestamp + */ + public static Date parseDateFromInstantTimeSafely(String timestamp) { +Date parsedDate; +try { + parsedDate = HoodieInstantTimeGenerator.parseDateFromInstantTime(timestamp); +} catch (ParseException e) { + LOG.warn("Failed to parse timestamp " + timestamp + " because of " + e.getMessage()); Review Comment: can we move this warn msg into else block. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-4515) savepoints will be clean in keeping latest versions policy
[ https://issues.apache.org/jira/browse/HUDI-4515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan closed HUDI-4515. - Resolution: Fixed > savepoints will be clean in keeping latest versions policy > -- > > Key: HUDI-4515 > URL: https://issues.apache.org/jira/browse/HUDI-4515 > Project: Apache Hudi > Issue Type: Bug > Components: cleaning >Affects Versions: 0.11.1 >Reporter: zouxxyy >Assignee: zouxxyy >Priority: Blocker > Labels: bug, clean, pull-request-available, savepoints > Fix For: 0.12.1 > > > When I tested the behavior of clean and savepoint, I found that when clean is > keeping latest versions, the files of savepoint will be deleted. By reading > the code, I found that this should be a bug > > For example, if I use "HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS", and > set the “hoodie.cleaner.fileversions.retained” to 2, I do the following: > 1. insert, get _001.parquet > 2. savepoint > 3. insert, get _002.parquet > 4. insert, get _003.parquet > After the fourth step, the _001.parquet will be deleted even if it > belongs to savepoint ! > > here is: > hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java: > getFilesToCleanKeepingLatestVersions > * According to the following code, on the one hand, the checkpoints > belonging to keepversion will be skipped and will not be counted in the > calculation of keepversion, which I feel is unreasonable. > * On the other hand, if there is a checkpoint in the remaining version of > the files, it will be deleted, which I don't think is in line with the design > philosophy of savepoints. > {code:java} > while (fileSliceIterator.hasNext() && keepVersions > 0) { > // Skip this most recent version > FileSlice nextSlice = fileSliceIterator.next(); > Option dataFile = nextSlice.getBaseFile(); > if (dataFile.isPresent() && > savepointedFiles.contains(dataFile.get().getFileName())) { > // do not clean up a savepoint data file > continue; > } > keepVersions--; > } > // Delete the remaining files > while (fileSliceIterator.hasNext()) { > FileSlice nextSlice = fileSliceIterator.next(); > deletePaths.addAll(getCleanFileInfoForSlice(nextSlice)); > }{code} > > So I think the judgment logic of the checkpoint should be moved down, if can > be fixed by this: > {code:java} > while (fileSliceIterator.hasNext() && keepVersions > 0) { > // Skip this most recent version > fileSliceIterator.next(); > keepVersions--; > } > // Delete the remaining files > while (fileSliceIterator.hasNext()) { > FileSlice nextSlice = fileSliceIterator.next(); > Option dataFile = nextSlice.getBaseFile(); > if (dataFile.isPresent() && > savepointedFiles.contains(dataFile.get().getFileName())) { > // do not clean up a savepoint data file > continue; > } > deletePaths.addAll(getCleanFileInfoForSlice(nextSlice)); > }{code} > > Thanks. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] nsivabalan merged pull request #6267: [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy
nsivabalan merged PR #6267: URL: https://github.com/apache/hudi/pull/6267 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (1879efa45d -> ca8a57a21d)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from 1879efa45d [HUDI-4686] Flip option 'write.ignore.failed' to default false (#6467) add ca8a57a21d [HUDI-4515] Fix savepoints will be cleaned in keeping latest versions policy (#6267) No new revisions were added by this update. Summary of changes: .../hudi/table/action/clean/CleanPlanner.java | 10 +-- .../org/apache/hudi/client/TestClientRollback.java | 98 ++ 2 files changed, 103 insertions(+), 5 deletions(-)
[GitHub] [hudi] nsivabalan commented on pull request #6157: [HUDI-4431] Fix log file will not roll over to a new file
nsivabalan commented on PR #6157: URL: https://github.com/apache/hudi/pull/6157#issuecomment-1225068635 @XuQianJin-Stars : hey. can you follow up on this. do we need a fix or if its already taken care. let us know. we can close it out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha commented on pull request #6482: [DOCS] Add youtube channel and Office hours page
bhasudha commented on PR #6482: URL: https://github.com/apache/hudi/pull/6482#issuecomment-1225060413 **Image of the header** https://user-images.githubusercontent.com/2179254/186296081-1401a649-663e-4db0-9c67-5aef18ff6042.png;> The logo is updated but is not usually visible in local website deployment. Thats why you see an icon. **Image of footer** https://user-images.githubusercontent.com/2179254/186296096-288d4329-2c22-4966-9616-5df52bbc8265.png;> You can verify the link as well **Image of weekly office hours page** https://user-images.githubusercontent.com/2179254/186296219-c02372b9-b72b-4985-b51a-89feba287f1b.png;> **Image of office hours in drop down** https://user-images.githubusercontent.com/2179254/186296640-85aee702-ed65-42d9-b1a3-31365789ee01.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha opened a new pull request, #6482: [DOCS] Add youttube channel and Office hours page
bhasudha opened a new pull request, #6482: URL: https://github.com/apache/hudi/pull/6482 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6135: [HUDI-4418] Add support for ProtoKafkaSource
hudi-bot commented on PR #6135: URL: https://github.com/apache/hudi/pull/6135#issuecomment-1225016353 ## CI report: * d36fed637603d9959e8d049ac0815b9c729eb246 UNKNOWN * f70abbc3b45005d40e74252814edc0078a50030e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10909) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark 3.2 the default profile
hudi-bot commented on PR #6105: URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224985799 ## CI report: * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN * 326f8f69ea423a58df8c98f382528efb9424d053 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10908) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6135: [HUDI-4418] Add support for ProtoKafkaSource
hudi-bot commented on PR #6135: URL: https://github.com/apache/hudi/pull/6135#issuecomment-1224982778 ## CI report: * d36fed637603d9959e8d049ac0815b9c729eb246 UNKNOWN * 1879403e5a33bfcaa6d9d1d3e6e2cbc226403f90 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10905) * f70abbc3b45005d40e74252814edc0078a50030e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10909) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark 3.2 the default profile
hudi-bot commented on PR #6105: URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224982712 ## CI report: * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN * 269aef1e346d379cdb5b76eb2aab9fc2945dcfc9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10907) * 326f8f69ea423a58df8c98f382528efb9424d053 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10908) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6135: [HUDI-4418] Add support for ProtoKafkaSource
hudi-bot commented on PR #6135: URL: https://github.com/apache/hudi/pull/6135#issuecomment-1224979442 ## CI report: * d36fed637603d9959e8d049ac0815b9c729eb246 UNKNOWN * 1879403e5a33bfcaa6d9d1d3e6e2cbc226403f90 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10905) * f70abbc3b45005d40e74252814edc0078a50030e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark 3.2 the default profile
hudi-bot commented on PR #6105: URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224979379 ## CI report: * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN * 269aef1e346d379cdb5b76eb2aab9fc2945dcfc9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10907) * 326f8f69ea423a58df8c98f382528efb9424d053 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6432: [HUDI-4586] Improve metadata fetching in bloom index
hudi-bot commented on PR #6432: URL: https://github.com/apache/hudi/pull/6432#issuecomment-1224976328 ## CI report: * ed15f57dc58b2e9142dd33a0ecd078bf4c236afc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10887) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6352: [HUDI-4584] Fixing `SQLConf` not being propagated to executor
alexeykudinkin commented on code in PR #6352: URL: https://github.com/apache/hudi/pull/6352#discussion_r953174810 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/execution/SQLConfInjectingRDD.scala: ## @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import org.apache.spark.{Partition, TaskContext} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.internal.SQLConf + +import scala.reflect.ClassTag + +/** + * NOTE: This is a generalized version of of Spark's [[SQLExecutionRDD]] + * + * It is just a wrapper over [[sqlRDD]] which sets and makes effective all the configs from the + * captured [[SQLConf]] + * + * @param sqlRDD the `RDD` generated by the SQL plan + * @param conf the `SQLConf` to apply to the execution of the SQL plan + */ +class SQLConfInjectingRDD[T: ClassTag](var sqlRDD: RDD[T], @transient conf: SQLConf) extends RDD[T](sqlRDD) { + private val sqlConfigs = conf.getAllConfs + private lazy val sqlConfExecutorSide = { +val newConf = new SQLConf() +sqlConfigs.foreach { case (k, v) => newConf.setConfString(k, v) } +newConf + } + + override val partitioner = firstParent[InternalRow].partitioner + + override def getPartitions: Array[Partition] = firstParent[InternalRow].partitions + + override def compute(split: Partition, context: TaskContext): Iterator[T] = { +// If we are in the context of a tracked SQL operation, `SQLExecution.EXECUTION_ID_KEY` is set +// and we have nothing to do here. Otherwise, we use the `SQLConf` captured at the creation of +// this RDD. +if (context.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) == null) { + SQLConf.withExistingConf(sqlConfExecutorSide) { Review Comment: Yes, it will propagate to all RDDs in the execution chain (up to a shuffling point) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #6352: [HUDI-4584] Fixing `SQLConf` not being propagated to executor
yihua commented on code in PR #6352: URL: https://github.com/apache/hudi/pull/6352#discussion_r953160525 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/execution/SQLConfInjectingRDD.scala: ## @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import org.apache.spark.{Partition, TaskContext} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.internal.SQLConf + +import scala.reflect.ClassTag + +/** + * NOTE: This is a generalized version of of Spark's [[SQLExecutionRDD]] + * + * It is just a wrapper over [[sqlRDD]] which sets and makes effective all the configs from the + * captured [[SQLConf]] + * + * @param sqlRDD the `RDD` generated by the SQL plan + * @param conf the `SQLConf` to apply to the execution of the SQL plan + */ +class SQLConfInjectingRDD[T: ClassTag](var sqlRDD: RDD[T], @transient conf: SQLConf) extends RDD[T](sqlRDD) { + private val sqlConfigs = conf.getAllConfs + private lazy val sqlConfExecutorSide = { +val newConf = new SQLConf() +sqlConfigs.foreach { case (k, v) => newConf.setConfString(k, v) } +newConf + } + + override val partitioner = firstParent[InternalRow].partitioner + + override def getPartitions: Array[Partition] = firstParent[InternalRow].partitions + + override def compute(split: Partition, context: TaskContext): Iterator[T] = { +// If we are in the context of a tracked SQL operation, `SQLExecution.EXECUTION_ID_KEY` is set +// and we have nothing to do here. Otherwise, we use the `SQLConf` captured at the creation of +// this RDD. +if (context.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) == null) { + SQLConf.withExistingConf(sqlConfExecutorSide) { Review Comment: @alexeykudinkin I was asking the latter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6432: [HUDI-4586] Improve metadata fetching in bloom index
hudi-bot commented on PR #6432: URL: https://github.com/apache/hudi/pull/6432#issuecomment-1224932396 ## CI report: * ed15f57dc58b2e9142dd33a0ecd078bf4c236afc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10887) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark 3.2 the default profile
hudi-bot commented on PR #6105: URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224931869 ## CI report: * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN * 269aef1e346d379cdb5b76eb2aab9fc2945dcfc9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10907) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6432: [HUDI-4586] Improve metadata fetching in bloom index
hudi-bot commented on PR #6432: URL: https://github.com/apache/hudi/pull/6432#issuecomment-1224927464 ## CI report: * ed15f57dc58b2e9142dd33a0ecd078bf4c236afc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dyang108 commented on issue #6428: [SUPPORT] S3 Deltastreamer: Block has already been inflated
dyang108 commented on issue #6428: URL: https://github.com/apache/hudi/issues/6428#issuecomment-1224925928 Update: I got it working on an older version of Hudi 0.10.1, so seems like a regression -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nikspatel03 commented on issue #6441: Status on PR: 2666: Support update partial fields for CoW table
nikspatel03 commented on issue #6441: URL: https://github.com/apache/hudi/issues/6441#issuecomment-1224887332 What I understand -> OverwriteNonDefaultsWithLatestAvroPayload can update the non-null fields in the new data(cdc) to the old data(Hudi table) But what if I have multiple changes for the same Record key into new cdc data then it won't give me correct output. For example: Hudi Table: RK1, F1, F2, F3, F4, F5 New cdc data: RK1, null, null, F3', null, F5' RK2, F1', null, F3", null, null RK3, null, F2', null, F4', F5" So Expected output of Record key(RK1) row in Hudi Table would be: RK1, F1', F2', F3", F4', F5" Is there any future plan to merge following work into Hudi master which can help us to get partial updates ? https://github.com/apache/hudi/pull/2666 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark 3.2 the default profile
hudi-bot commented on PR #6105: URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224873599 ## CI report: * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN * 35c07f36c6409d471e1810833cec0b27cbf78cf9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10906) * 269aef1e346d379cdb5b76eb2aab9fc2945dcfc9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10907) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nochimow closed issue #4622: [SUPPORT] Can't query Redshift rows even after downgrade from 0.10
nochimow closed issue #4622: [SUPPORT] Can't query Redshift rows even after downgrade from 0.10 URL: https://github.com/apache/hudi/issues/4622 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nochimow commented on issue #4622: [SUPPORT] Can't query Redshift rows even after downgrade from 0.10
nochimow commented on issue #4622: URL: https://github.com/apache/hudi/issues/4622#issuecomment-1224870906 Even with AWS saying that only 0.10.0 is "supported", I did some compatibility tests with Hudi 0.10, 0.11 and 0.12. All versions worked fine, like it wasn't before. (Prior to that, any table with Hudi version > 0.9 was returning 0 rows on Redshift Spectrum) The only detail here is that the Redshift version must be with the patch >=169. (Got this requirement from the AWS support) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6474: [SUPPORT] Hudi Deltastreamer fails to acquire lock with DynamoDB Lock Provider.
nsivabalan commented on issue #6474: URL: https://github.com/apache/hudi/issues/6474#issuecomment-1224816176 yeah. From what I see, the cleaner waits for the lock (which was acquired to apply `20220822020402958` to metadata table", but after retrying, before giving up, looks like the cleaner unlocks which should not happen. We did made a fix in 0.11.1 to avoid non owner releasing the lock [here](https://github.com/apache/hudi/pull/5255), but looks like there could be more to be looked into. ``` 02:06:31 : acquiring lock by 20220822020402958__deltacommit__INFLIGHT in MDT. 02:06:46 : clean is attempted in data table. (async cleaner) 02:06:48: clean tries to acquire lock. 22/08/22 02:06:48 INFO org.apache.hudi.client.transaction.TransactionManager: Transaction starting for Optional.empty with latest completed transaction instant Optional.empty 22/08/22 02:06:48 INFO org.apache.hudi.client.transaction.lock.LockManager: LockProvider org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider 02:07:47: after checking for compaction, new delta commit started in MDT. 02:07:50: we see the deltacommit state is moved to completed. 02:08:22 : new delta commit starts. regular writer. 02:08:59: tries to acquire lock. 02:11:10: tries to acquire lock. 02:13:21: tries to acquire lock. 02:15:32:tries to acquire lock. 02:17:43:tries to acquire lock. 02:19:53:tries to acquire lock. 02:22:04:tries to acquire lock. 02:24:15:tries to acquire lock. 02:26:25:tries to acquire lock. 02:28:36:tries to acquire lock. 02:30:47: INFO org.apache.hudi.client.transaction.TransactionManager: Transaction ending with transaction owner Optional.empty 22/08/22 02:30:47 INFO org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: RELEASING lock at DynamoDb table = HudiLocker, partition key = process 22/08/22 02:30:47 INFO org.apache.hudi.client.transaction.TransactionManager: Transaction ended with transaction owner Optional.empty 25 mins so far from the time clean tried to acquire lock. clean fails since it could not acquire the lock. 22/08/22 02:31:00 : original owner who acquired the lock is releasing it now. 22/08/22 02:31:00 INFO org.apache.hudi.client.transaction.TransactionManager: Transaction ending with transaction owner Option{val=[==>20220822020402958__deltacommit__INFLIGHT]} 22/08/22 02:31:00 INFO org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: RELEASING lock at DynamoDb table = HudiLocker, partition key = process 22/08/22 02:31:00 INFO org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider: RELEASED lock at DynamoDb table = HudiLocker, partition key = process 22/08/22 02:31:00 INFO org.apache.hudi.client.transaction.TransactionManager: Transaction ended with transaction owner Option{val=[==>20220822020402958__deltacommit__INFLIGHT]} ``` I might need to spend some more time to put in a fix for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark 3.2 the default profile
hudi-bot commented on PR #6105: URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224756201 ## CI report: * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN * 35c07f36c6409d471e1810833cec0b27cbf78cf9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10906) * 269aef1e346d379cdb5b76eb2aab9fc2945dcfc9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark3 the default profile
hudi-bot commented on PR #6105: URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224734575 ## CI report: * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN * 58aadea50328122e1a9a1b01d38e3af12e33fbe1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9947) * 35c07f36c6409d471e1810833cec0b27cbf78cf9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10906) * 269aef1e346d379cdb5b76eb2aab9fc2945dcfc9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6105: Make Spark3 the default profile
hudi-bot commented on PR #6105: URL: https://github.com/apache/hudi/pull/6105#issuecomment-1224723686 ## CI report: * ec2ecf42597af2586cd3864b297f15b881cf204d UNKNOWN * 58aadea50328122e1a9a1b01d38e3af12e33fbe1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9947) * 35c07f36c6409d471e1810833cec0b27cbf78cf9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] minihippo commented on pull request #5920: [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool
minihippo commented on PR #5920: URL: https://github.com/apache/hudi/pull/5920#issuecomment-1224555479 > > can we please write a test for the changes made. > > any instruction on how to write a test? Hi @kk17, you can refer to the ut in `TestHiveSyncTool`, mock a table performed in 0.8, and call the new sync function to make hive sync in 0.11 success. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6135: [HUDI-4418] Add support for ProtoKafkaSource
hudi-bot commented on PR #6135: URL: https://github.com/apache/hudi/pull/6135#issuecomment-1224544985 ## CI report: * d36fed637603d9959e8d049ac0815b9c729eb246 UNKNOWN * 1879403e5a33bfcaa6d9d1d3e6e2cbc226403f90 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10905) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-2369) Blog on bulk insert sort modes
[ https://issues.apache.org/jira/browse/HUDI-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-2369: -- Sprint: 2022/09/05 > Blog on bulk insert sort modes > -- > > Key: HUDI-2369 > URL: https://issues.apache.org/jira/browse/HUDI-2369 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > Blog on bulk insert sort modes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2369) Blog on bulk insert sort modes
[ https://issues.apache.org/jira/browse/HUDI-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-2369: -- Fix Version/s: 0.12.1 > Blog on bulk insert sort modes > -- > > Key: HUDI-2369 > URL: https://issues.apache.org/jira/browse/HUDI-2369 > Project: Apache Hudi > Issue Type: Task > Components: docs >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > Blog on bulk insert sort modes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua commented on pull request #6442: [HUDI-4449] Support DataSourceV2 Read for Spark3.2
yihua commented on PR #6442: URL: https://github.com/apache/hudi/pull/6442#issuecomment-1224494779 @alexeykudinkin FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4496) ORC fails w/ Spark 3.1
[ https://issues.apache.org/jira/browse/HUDI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4496: -- Fix Version/s: 13.0 > ORC fails w/ Spark 3.1 > -- > > Key: HUDI-4496 > URL: https://issues.apache.org/jira/browse/HUDI-4496 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 13.0 > > > After running TestHoodieSparkSqlWriter test for different Spark versions, > discovered that Orc version was incorrectly put as compile time dep on the > classpath, breaking Orc writing in Hudi in Spark 3.1: > https://github.com/apache/hudi/runs/7567326789?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4389) Make HoodieStreamingSink idempotent
[ https://issues.apache.org/jira/browse/HUDI-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4389: -- Sprint: 2022/08/22 (was: 2022/09/19) > Make HoodieStreamingSink idempotent > --- > > Key: HUDI-4389 > URL: https://issues.apache.org/jira/browse/HUDI-4389 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Blocker > Labels: pull-request-available, streaming > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2673) Add integration/e2e test for kafka-connect functionality
[ https://issues.apache.org/jira/browse/HUDI-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-2673: -- Sprint: Hudi-Sprint-Apr-19, Hudi-Sprint-Apr-25, 2022/05/02, 2022/05/16, 2022/08/22 (was: Hudi-Sprint-Apr-19, Hudi-Sprint-Apr-25, 2022/05/02, 2022/05/16) > Add integration/e2e test for kafka-connect functionality > > > Key: HUDI-2673 > URL: https://issues.apache.org/jira/browse/HUDI-2673 > Project: Apache Hudi > Issue Type: Task > Components: kafka-connect, tests-ci >Reporter: Ethan Guo >Assignee: Raymond Xu >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > > The integration test should use bundle jar and run in docker setup. This can > prevent any issue in the bundle, like HUDI-3903, that is not covered by unit > and functional tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default
yihua commented on PR #6196: URL: https://github.com/apache/hudi/pull/6196#issuecomment-1224476707 @alexeykudinkin could you also review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4212) kafka-connect module: Unresolved dependency: 'jdk.tools:jdk.tools:jar:1.7'
[ https://issues.apache.org/jira/browse/HUDI-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-4212: -- Sprint: 2022/08/08, 2022/09/05 (was: 2022/08/08, 2022/08/22) > kafka-connect module: Unresolved dependency: 'jdk.tools:jdk.tools:jar:1.7' > -- > > Key: HUDI-4212 > URL: https://issues.apache.org/jira/browse/HUDI-4212 > Project: Apache Hudi > Issue Type: Improvement > Components: dependencies, dev-experience, kafka-connect >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > Project import first time and IDE complains about Unresolved dependency: > 'jdk.tools:jdk.tools:jar:1.7' for kafka-connect module. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] rmahindra123 commented on issue #6348: [SUPPORT] Hudi error while running HoodieMultiTableDeltaStreamer: Commit 20220809112130103 failed and rolled-back !
rmahindra123 commented on issue #6348: URL: https://github.com/apache/hudi/issues/6348#issuecomment-1224406489 For Multitable Deltastreamer, it runs the ingestion sequentially, so it will first ingest table1 and then table2. Let me know if you still are facing issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6135: [HUDI-4418] Add support for ProtoKafkaSource
hudi-bot commented on PR #6135: URL: https://github.com/apache/hudi/pull/6135#issuecomment-1224389948 ## CI report: * d36fed637603d9959e8d049ac0815b9c729eb246 UNKNOWN * 14115a6f79de39f538ddfba407f84249c35ebca5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10881) * 1879403e5a33bfcaa6d9d1d3e6e2cbc226403f90 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10905) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6456: [HUDI-4674]Change the default value of inputFormat for the MOR table
alexeykudinkin commented on code in PR #6456: URL: https://github.com/apache/hudi/pull/6456#discussion_r952916053 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala: ## @@ -120,10 +119,8 @@ object CreateHoodieTableCommand { val tableType = tableConfig.getTableType.name() val inputFormat = tableType match { - case DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL => + case DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL | DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL => Review Comment: @linfey90 i don't think this change makes sense to me. Can you please elaborate what you're trying to achieve here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6135: [HUDI-4418] Add support for ProtoKafkaSource
hudi-bot commented on PR #6135: URL: https://github.com/apache/hudi/pull/6135#issuecomment-1224382862 ## CI report: * d36fed637603d9959e8d049ac0815b9c729eb246 UNKNOWN * 14115a6f79de39f538ddfba407f84249c35ebca5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10881) * 1879403e5a33bfcaa6d9d1d3e6e2cbc226403f90 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin closed pull request #6193: [WIP] Fixing logging dependencies and configs
alexeykudinkin closed pull request #6193: [WIP] Fixing logging dependencies and configs URL: https://github.com/apache/hudi/pull/6193 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] alexeykudinkin commented on pull request #6193: [WIP] Fixing logging dependencies and configs
alexeykudinkin commented on PR #6193: URL: https://github.com/apache/hudi/pull/6193#issuecomment-1224359433 Yeah, this could be closed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on pull request #6193: [WIP] Fixing logging dependencies and configs
yihua commented on PR #6193: URL: https://github.com/apache/hudi/pull/6193#issuecomment-1224348069 @alexeykudinkin Is this still needed or replaced by #6170? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4586) Address S3 timeouts in Bloom Index with metadata table
[ https://issues.apache.org/jira/browse/HUDI-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4586: - Story Points: 1 (was: 5) > Address S3 timeouts in Bloom Index with metadata table > -- > > Key: HUDI-4586 > URL: https://issues.apache.org/jira/browse/HUDI-4586 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > Attachments: Screen Shot 2022-08-15 at 17.39.01.png > > > For partitioned table, there are significant number of S3 requests timeout > causing the upserts to fail when using Bloom Index with metadata table. > {code:java} > Load meta index key ranges for file slices: hudi > collect at HoodieSparkEngineContext.java:137+details > org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45) > org.apache.hudi.client.common.HoodieSparkEngineContext.flatMap(HoodieSparkEngineContext.java:137) > org.apache.hudi.index.bloom.HoodieBloomIndex.loadColumnRangesFromMetaIndex(HoodieBloomIndex.java:213) > org.apache.hudi.index.bloom.HoodieBloomIndex.getBloomIndexFileInfoForPartitions(HoodieBloomIndex.java:145) > org.apache.hudi.index.bloom.HoodieBloomIndex.lookupIndex(HoodieBloomIndex.java:123) > org.apache.hudi.index.bloom.HoodieBloomIndex.tagLocation(HoodieBloomIndex.java:89) > org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:49) > org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:32) > org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:53) > org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:45) > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:113) > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:97) > org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:155) > org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206) > org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:329) > org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:183) > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > {code} > {code:java} > org.apache.hudi.exception.HoodieException: Exception when reading log file > at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:352) > at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:196) > at > org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.getRecordsByKeys(HoodieMetadataMergedLogRecordReader.java:124) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.readLogRecords(HoodieBackedTableMetadata.java:266) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$1(HoodieBackedTableMetadata.java:222) > at java.util.HashMap.forEach(HashMap.java:1290) > at > org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:209) > at > org.apache.hudi.metadata.BaseTableMetadata.getColumnStats(BaseTableMetadata.java:253) > at > org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadColumnRangesFromMetaIndex$cc8e7ca2$1(HoodieBloomIndex.java:224) > at > org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:137) > at > org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) > at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) > at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) > at
[jira] [Updated] (HUDI-4635) Update roadmap page based on H2 2022 plan
[ https://issues.apache.org/jira/browse/HUDI-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4635: - Story Points: 0.5 (was: 1) > Update roadmap page based on H2 2022 plan > - > > Key: HUDI-4635 > URL: https://issues.apache.org/jira/browse/HUDI-4635 > Project: Apache Hudi > Issue Type: Improvement > Components: docs >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.12.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3636) Clustering fails due to marker creation failure
[ https://issues.apache.org/jira/browse/HUDI-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3636: - Story Points: 2 (was: 4) > Clustering fails due to marker creation failure > --- > > Key: HUDI-3636 > URL: https://issues.apache.org/jira/browse/HUDI-3636 > Project: Apache Hudi > Issue Type: Bug > Components: multi-writer >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > Scenario: multi-writer test, one writer doing ingesting with Deltastreamer > continuous mode, COW, inserts, async clustering and cleaning (partitions > under 2022/1, 2022/2), another writer with Spark datasource doing backfills > to different partitions (2021/12). > 0.10.0 no MT, clustering instant is inflight (failing it in the middle before > upgrade) ➝ 0.11 MT, with multi-writer configuration the same as before. > The clustering/replace instant cannot make progress due to marker creation > failure, failing the DS ingestion as well. Need to investigate if this is > timeline-server-based marker related or MT related. > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in > stage 46.0 failed 1 times, most recent failure: Lost task 2.0 in stage 46.0 > (TID 277) (192.168.70.231 executor driver): java.lang.RuntimeException: > org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file > 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE > Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] > failed: Connection refused (Connection refused) > at > org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) > at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) > at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) > at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) > at scala.collection.AbstractIterator.to(Iterator.scala:1431) > at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) > at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) > at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) > at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) > at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030) > at > org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file > 2022/1/24/aa2f24d3-882f-4d48-b20e-9fcd3540c7a7-0_2-46-277_20220314101326706.parquet.marker.CREATE > Connect to localhost:26754 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] > failed: Connection refused (Connection refused) > at > org.apache.hudi.execution.SparkLazyInsertIterable.computeNext(SparkLazyInsertIterable.java:94) > at >
[jira] [Updated] (HUDI-4585) Optimize query performance on Presto Hudi connector
[ https://issues.apache.org/jira/browse/HUDI-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4585: - Story Points: 0 (was: 2) > Optimize query performance on Presto Hudi connector > > > Key: HUDI-4585 > URL: https://issues.apache.org/jira/browse/HUDI-4585 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.12.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2955) Upgrade Hadoop to 3.3.x
[ https://issues.apache.org/jira/browse/HUDI-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2955: - Sprint: Hudi-Sprint-Feb-14, Hudi-Sprint-Mar-14, Hudi-Sprint-Mar-21, Hudi-Sprint-Mar-22, Hudi-Sprint-Apr-05, Hudi-Sprint-Apr-19, Hudi-Sprint-Apr-25, 2022/05/02, 2022/05/16, 2022/05/31 (was: Hudi-Sprint-Feb-14, Hudi-Sprint-Mar-14, Hudi-Sprint-Mar-21, Hudi-Sprint-Mar-22, Hudi-Sprint-Apr-05, Hudi-Sprint-Apr-19, Hudi-Sprint-Apr-25, 2022/05/02, 2022/05/16, 2022/05/31, 2022/08/22) > Upgrade Hadoop to 3.3.x > --- > > Key: HUDI-2955 > URL: https://issues.apache.org/jira/browse/HUDI-2955 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Alexey Kudinkin >Assignee: Rahil Chertara >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: Screen Shot 2021-12-07 at 2.32.51 PM.png > > > According to Hadoop compatibility matrix, this is a pre-requisite to > upgrading to JDK11: > !Screen Shot 2021-12-07 at 2.32.51 PM.png|width=938,height=230! > [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions] > > *Upgrading Hadoop from 2.x to 3.x* > [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.x+to+3.x+Upgrade+Efforts] > Everything (relevant to us) seems to be in a good shape, except Spark 2.2/.3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6352: [HUDI-4584] Fixing `SQLConf` not being propagated to executor
alexeykudinkin commented on code in PR #6352: URL: https://github.com/apache/hudi/pull/6352#discussion_r952852924 ## hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/execution/SQLConfInjectingRDD.scala: ## @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution + +import org.apache.spark.{Partition, TaskContext} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.internal.SQLConf + +import scala.reflect.ClassTag + +/** + * NOTE: This is a generalized version of of Spark's [[SQLExecutionRDD]] + * + * It is just a wrapper over [[sqlRDD]] which sets and makes effective all the configs from the + * captured [[SQLConf]] + * + * @param sqlRDD the `RDD` generated by the SQL plan + * @param conf the `SQLConf` to apply to the execution of the SQL plan + */ +class SQLConfInjectingRDD[T: ClassTag](var sqlRDD: RDD[T], @transient conf: SQLConf) extends RDD[T](sqlRDD) { + private val sqlConfigs = conf.getAllConfs + private lazy val sqlConfExecutorSide = { +val newConf = new SQLConf() +sqlConfigs.foreach { case (k, v) => newConf.setConfString(k, v) } +newConf + } + + override val partitioner = firstParent[InternalRow].partitioner + + override def getPartitions: Array[Partition] = firstParent[InternalRow].partitions + + override def compute(split: Partition, context: TaskContext): Iterator[T] = { +// If we are in the context of a tracked SQL operation, `SQLExecution.EXECUTION_ID_KEY` is set +// and we have nothing to do here. Otherwise, we use the `SQLConf` captured at the creation of +// this RDD. +if (context.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) == null) { + SQLConf.withExistingConf(sqlConfExecutorSide) { Review Comment: Not sure i understood your question: do you mean whether we're wrapping any other chained RDD, or whether the SQLConf will get propagated to every other chained RDD? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pomaster commented on issue #6344: [SUPPORT] spark-sql schema_evolution
pomaster commented on issue #6344: URL: https://github.com/apache/hudi/issues/6344#issuecomment-1224294973 @nsivabalan Looked like @KnightChess has updated the doc already. Thanks @KnightChess. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4659) Develop a validation tool for bootstrap table
[ https://issues.apache.org/jira/browse/HUDI-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4659: - Sprint: 2022/09/05 (was: 2022/08/22) > Develop a validation tool for bootstrap table > - > > Key: HUDI-4659 > URL: https://issues.apache.org/jira/browse/HUDI-4659 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-1369) Bootstrap datasource jobs from hanging via spark-submit
[ https://issues.apache.org/jira/browse/HUDI-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-1369: - Sprint: 2022/09/05 (was: 2022/08/22) > Bootstrap datasource jobs from hanging via spark-submit > --- > > Key: HUDI-1369 > URL: https://issues.apache.org/jira/browse/HUDI-1369 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.13.0 > > > MOR table creation via Hudi datasource hangs at the end of the spark-submit > job. > Looks like {{HoodieWriteClient}} at > [https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255] > not being closed which does not stop the timeline server at the end, and as > a result the job hangs and never exits. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4125) Add IT (Azure CI) around bootstrapped Hudi table
[ https://issues.apache.org/jira/browse/HUDI-4125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4125: - Sprint: 2022/09/05 (was: 2022/08/22) > Add IT (Azure CI) around bootstrapped Hudi table > > > Key: HUDI-4125 > URL: https://issues.apache.org/jira/browse/HUDI-4125 > Project: Apache Hudi > Issue Type: Task >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.13.0 > > > For bootstrapped Hudi table with bootstrap format, the table can be queried > through different engines without any issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6481: [HUDI-4698] Rename the package 'org.apache.flink.table.data' to avoid…
hudi-bot commented on PR #6481: URL: https://github.com/apache/hudi/pull/6481#issuecomment-1224281869 ## CI report: * 3eb012affd4283f9970445bf3dbf4cb48afc25bf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10903) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6480: [HUDI-4687] add show_invalid_parquet procedure
hudi-bot commented on PR #6480: URL: https://github.com/apache/hudi/pull/6480#issuecomment-1224281819 ## CI report: * 9d161840463bb97d4872ce8a2c376cb9e0d00440 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10904) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rmahindra123 commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3
rmahindra123 commented on issue #6278: URL: https://github.com/apache/hudi/issues/6278#issuecomment-1224268630 Confirmed that #6352 resolves the issue after adding the following config: --conf spark.sql.avro.datetimeRebaseModeInWrite=LEGACY -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4585) Optimize query performance on Presto Hudi connector
[ https://issues.apache.org/jira/browse/HUDI-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4585: - Story Points: 2 (was: 10) > Optimize query performance on Presto Hudi connector > > > Key: HUDI-4585 > URL: https://issues.apache.org/jira/browse/HUDI-4585 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.12.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] rmahindra123 commented on issue #6278: [SUPPORT] Deltastreamer fails with data and timestamp related exception after upgrading to EMR 6.5 and spark3
rmahindra123 commented on issue #6278: URL: https://github.com/apache/hudi/issues/6278#issuecomment-1224263388 Was able to reproduce by adding the following line in my source: newDataSet = newDataSet.withColumn("invalidDates", functions.lit("1000-01-11").cast(DataTypes.DateType)); Full stacktrace here: https://gist.github.com/rmahindra123/4ab3614ef6ce30ee2c72499f2633de57 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4468) Simplify TimeTravel logic for Spark 3.3
[ https://issues.apache.org/jira/browse/HUDI-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4468: - Sprint: 2022/09/19 (was: 2022/08/22) > Simplify TimeTravel logic for Spark 3.3 > --- > > Key: HUDI-4468 > URL: https://issues.apache.org/jira/browse/HUDI-4468 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Shawn Chang >Assignee: Alexey Kudinkin >Priority: Major > Fix For: 0.12.1 > > > Existing Hudi relies on .g4 files and antlr classes to make time travel work > for Spark 3.2 > As time travel is supported on Spark 3.3. Those logic can be greatly > simplified and some of them can also be removed -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4467) Port borrowed code from Spark 3.3
[ https://issues.apache.org/jira/browse/HUDI-4467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4467: - Sprint: 2022/09/19 (was: 2022/08/22) > Port borrowed code from Spark 3.3 > - > > Key: HUDI-4467 > URL: https://issues.apache.org/jira/browse/HUDI-4467 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Shawn Chang >Assignee: Alexey Kudinkin >Priority: Major > Fix For: 0.12.1 > > > Currently some classes are copied from Spark32 module w/o/w only necessary > changes. we should port them from Spark 3.3 to use the latest implementation > in Spark > > Classes copied: > Spark33NestedSchemaPruning -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4465) Optimizing file-listing path in MT
[ https://issues.apache.org/jira/browse/HUDI-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4465: -- Story Points: 2 (was: 4) > Optimizing file-listing path in MT > -- > > Key: HUDI-4465 > URL: https://issues.apache.org/jira/browse/HUDI-4465 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > > We should review file-listing path and try to optimize the file-listing path > as much as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4588) Ingestion failing if source column is dropped
[ https://issues.apache.org/jira/browse/HUDI-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4588: - Story Points: 4 (was: 12) > Ingestion failing if source column is dropped > - > > Key: HUDI-4588 > URL: https://issues.apache.org/jira/browse/HUDI-4588 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Vamshi Gudavarthi >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available, schema, schema-evolution > Fix For: 0.12.1 > > Attachments: schema_stage1.avsc, schema_stage2.avsc, stage_1.json, > stage_2.json > > > Ingestion using Deltastreamer fails if columns are dropped from source. I had > reproduced using docker-demo setup. Below are the steps for reproducing it. > # I had created data file `stage_1.json`(attached), ingested it to kafka and > ingested to hudi-table from kafka using Deltastreamer job(using > FileschemaProvider with `schema_stage1.avsc`) > # Simulating column dropping from source in the next step. > # Repeat steps in step1 with stage2 files. Stage2 files doesn't have `day` > column, Ingestion job failed. Below is detailed stacktrace. > {code:java} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) > at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1098) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.fold(RDD.scala:1092) > at > org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply$mcD$sp(DoubleRDDFunctions.scala:35) > at > org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:35) > at > org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:35) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at > org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:34) > at org.apache.spark.api.java.JavaDoubleRDD.sum(JavaDoubleRDD.scala:165) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:607) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:335) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:201) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:199) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:557) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498)
[jira] [Updated] (HUDI-4691) Deduplicate Spark 3.2 and Spark 3.3 integrations
[ https://issues.apache.org/jira/browse/HUDI-4691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4691: - Story Points: 6 (was: 12) > Deduplicate Spark 3.2 and Spark 3.3 integrations > > > Key: HUDI-4691 > URL: https://issues.apache.org/jira/browse/HUDI-4691 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core, writer-core >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > > While adding support for Spark 3.3, considerable portion of the > version-specific integration was simply copied over from Spark 3.2 one, w/o > deliberation whether this is required or not. > We should address such duplication ASAP, to make sure that only necessary > pieces are duplicated to handle version specific behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4690) Remove code duplicated over from Spark
[ https://issues.apache.org/jira/browse/HUDI-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4690: -- Story Points: 12 (was: 5) > Remove code duplicated over from Spark > -- > > Key: HUDI-4690 > URL: https://issues.apache.org/jira/browse/HUDI-4690 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core, writer-core >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > > At present, a lot of code in `HoodieAnalysis` is unnecessarily duplicating > the resolution logic from Spark that leads to interference w/ normal > operations of Spark's Analyzer and leading to non-trivial issues (like > HUDI-4503) when dealing w/ Spark or Spark SQL > > We should minimize the amount of logic and code that is localized from Spark > to Hudi to strictly necessary to either > # Address issues (alternative to upstreaming in Spark) > # Back-port features (from newer Spark versions to older ones) > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-2955) Upgrade Hadoop to 3.3.x
[ https://issues.apache.org/jira/browse/HUDI-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2955: - Reviewers: Ethan Guo (was: Alexey Kudinkin, Ethan Guo) > Upgrade Hadoop to 3.3.x > --- > > Key: HUDI-2955 > URL: https://issues.apache.org/jira/browse/HUDI-2955 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Alexey Kudinkin >Assignee: Rahil Chertara >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: Screen Shot 2021-12-07 at 2.32.51 PM.png > > > According to Hadoop compatibility matrix, this is a pre-requisite to > upgrading to JDK11: > !Screen Shot 2021-12-07 at 2.32.51 PM.png|width=938,height=230! > [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions] > > *Upgrading Hadoop from 2.x to 3.x* > [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.x+to+3.x+Upgrade+Efforts] > Everything (relevant to us) seems to be in a good shape, except Spark 2.2/.3 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4584) SQLConf is not propagated correctly into RDDs
[ https://issues.apache.org/jira/browse/HUDI-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4584: -- Story Points: 6 (was: 8) > SQLConf is not propagated correctly into RDDs > - > > Key: HUDI-4584 > URL: https://issues.apache.org/jira/browse/HUDI-4584 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > > There were a few reports and slack as well as in GI, related to Spark SQL > configs not being respected by DeltaStreamer while working perfectly fine > when leveraging DataSource API: > [https://github.com/apache/hudi/issues/6278] > > I was able to trace these down to > # `HoodieSparkUtils.createRDD` instantiating `AvroSerializer` which uses > SQLConf that isn't propagated by Spark properly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4691) Deduplicate Spark 3.2 and Spark 3.3 integrations
[ https://issues.apache.org/jira/browse/HUDI-4691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4691: -- Story Points: 12 (was: 3) > Deduplicate Spark 3.2 and Spark 3.3 integrations > > > Key: HUDI-4691 > URL: https://issues.apache.org/jira/browse/HUDI-4691 > Project: Apache Hudi > Issue Type: Bug > Components: reader-core, writer-core >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.13.0 > > > While adding support for Spark 3.3, considerable portion of the > version-specific integration was simply copied over from Spark 3.2 one, w/o > deliberation whether this is required or not. > We should address such duplication ASAP, to make sure that only necessary > pieces are duplicated to handle version specific behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4588) Ingestion failing if source column is dropped
[ https://issues.apache.org/jira/browse/HUDI-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4588: -- Story Points: 12 (was: 5) > Ingestion failing if source column is dropped > - > > Key: HUDI-4588 > URL: https://issues.apache.org/jira/browse/HUDI-4588 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer >Reporter: Vamshi Gudavarthi >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available, schema, schema-evolution > Fix For: 0.12.1 > > Attachments: schema_stage1.avsc, schema_stage2.avsc, stage_1.json, > stage_2.json > > > Ingestion using Deltastreamer fails if columns are dropped from source. I had > reproduced using docker-demo setup. Below are the steps for reproducing it. > # I had created data file `stage_1.json`(attached), ingested it to kafka and > ingested to hudi-table from kafka using Deltastreamer job(using > FileschemaProvider with `schema_stage1.avsc`) > # Simulating column dropping from source in the next step. > # Repeat steps in step1 with stage2 files. Stage2 files doesn't have `day` > column, Ingestion job failed. Below is detailed stacktrace. > {code:java} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) > at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1098) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.fold(RDD.scala:1092) > at > org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply$mcD$sp(DoubleRDDFunctions.scala:35) > at > org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:35) > at > org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:35) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at > org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:34) > at org.apache.spark.api.java.JavaDoubleRDD.sum(JavaDoubleRDD.scala:165) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:607) > at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:335) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:201) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:199) > at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:557) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at
[jira] [Updated] (HUDI-4503) Support table identifier with explicit catalog
[ https://issues.apache.org/jira/browse/HUDI-4503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4503: -- Story Points: 4 (was: 2) > Support table identifier with explicit catalog > -- > > Key: HUDI-4503 > URL: https://issues.apache.org/jira/browse/HUDI-4503 > Project: Apache Hudi > Issue Type: Bug > Components: spark, spark-sql >Reporter: Yann Byron >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4626) Partitioning table by `_hoodie_partition_path` fails
[ https://issues.apache.org/jira/browse/HUDI-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4626: -- Story Points: 4 (was: 2) > Partitioning table by `_hoodie_partition_path` fails > > > Key: HUDI-4626 > URL: https://issues.apache.org/jira/browse/HUDI-4626 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.12.0 >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.12.1 > > > > Currently, creating a table partitioned by "_hoodie_partition_path" using > Glue catalog fails w/ the following exception: > {code:java} > AnalysisException: Found duplicate column(s) in the data schema and the > partition schema: _hoodie_partition_path > {code} > Using following DDL: > {code:java} > CREATE EXTERNAL TABLE `active_storage_attachments`( `_hoodie_commit_time` > string COMMENT '', `_hoodie_commit_seqno` string COMMENT '', > `_hoodie_record_key` string COMMENT '', `_hoodie_file_name` string COMMENT > '', `_change_operation_type` string COMMENT '', > `_upstream_event_processed_ts_ms` bigint COMMENT '', > `db_shard_source_partition` string COMMENT '', `_event_origin_ts_ms` bigint > COMMENT '', `_event_tx_id` bigint COMMENT '', `_event_lsn` bigint COMMENT > '', `_event_xmin` bigint COMMENT '', `id` bigint COMMENT '', `name` > string COMMENT '', `record_type` string COMMENT '', `record_id` bigint > COMMENT '', `blob_id` bigint COMMENT '', `created_at` timestamp COMMENT > '')PARTITIONED BY ( `_hoodie_partition_path` string COMMENT '')ROW FORMAT > SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH > SERDEPROPERTIES ( 'hoodie.query.as.ro.table'='false', 'path'='...') > STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION > '...' > TBLPROPERTIES ( 'spark.sql.sources.provider'='hudi' ) > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4584) SQLConf is not propagated correctly into RDDs
[ https://issues.apache.org/jira/browse/HUDI-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4584: -- Story Points: 8 (was: 4) > SQLConf is not propagated correctly into RDDs > - > > Key: HUDI-4584 > URL: https://issues.apache.org/jira/browse/HUDI-4584 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > > There were a few reports and slack as well as in GI, related to Spark SQL > configs not being respected by DeltaStreamer while working perfectly fine > when leveraging DataSource API: > [https://github.com/apache/hudi/issues/6278] > > I was able to trace these down to > # `HoodieSparkUtils.createRDD` instantiating `AvroSerializer` which uses > SQLConf that isn't propagated by Spark properly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4364) integrate column stats index with presto engine
[ https://issues.apache.org/jira/browse/HUDI-4364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4364: - Sprint: (was: 2022/08/22) > integrate column stats index with presto engine > --- > > Key: HUDI-4364 > URL: https://issues.apache.org/jira/browse/HUDI-4364 > Project: Apache Hudi > Issue Type: New Feature > Components: metadata, reader-core >Reporter: Pratyaksh Sharma >Assignee: Pratyaksh Sharma >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3397) Make sure Spark RDDs triggering actual FS activity are only dereferenced once
[ https://issues.apache.org/jira/browse/HUDI-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3397: -- Sprint: 2022/09/05 > Make sure Spark RDDs triggering actual FS activity are only dereferenced once > - > > Key: HUDI-3397 > URL: https://issues.apache.org/jira/browse/HUDI-3397 > Project: Apache Hudi > Issue Type: Bug > Components: writer-core >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: spark > Fix For: 0.13.0 > > > Currently, RDD `collect()` operation is treated quite loosely and there are > multiple flows which used to dereference RDDs (for ex, through `collect`, > `count`, etc) that way triggering the same operations being carried out > multiple times, occasionally duplicating the output already persisted on FS. > Check out HUDI-3370 for recent example. > NOTE: Even though Spark caching is supposed to make sure that we aren't > writing to FS multiple times, we can't solely rely on caching to guarantee > exactly once execution. > Instead, we should make sure that RDDs are only dereferenced {*}once{*}, w/in > "commit" operation and all the other operations are only relying on > _derivative_ data. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4465) Optimizing file-listing path in MT
[ https://issues.apache.org/jira/browse/HUDI-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-4465: -- Sprint: 2022/08/22 > Optimizing file-listing path in MT > -- > > Key: HUDI-4465 > URL: https://issues.apache.org/jira/browse/HUDI-4465 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.13.0 > > > We should review file-listing path and try to optimize the file-listing path > as much as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4467) Port borrowed code from Spark 3.3
[ https://issues.apache.org/jira/browse/HUDI-4467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4467: - Story Points: 5 > Port borrowed code from Spark 3.3 > - > > Key: HUDI-4467 > URL: https://issues.apache.org/jira/browse/HUDI-4467 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Shawn Chang >Assignee: Alexey Kudinkin >Priority: Major > Fix For: 0.12.1 > > > Currently some classes are copied from Spark32 module w/o/w only necessary > changes. we should port them from Spark 3.3 to use the latest implementation > in Spark > > Classes copied: > Spark33NestedSchemaPruning -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4468) Simplify TimeTravel logic for Spark 3.3
[ https://issues.apache.org/jira/browse/HUDI-4468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4468: - Sprint: 2022/08/22 (was: 2022/09/19) > Simplify TimeTravel logic for Spark 3.3 > --- > > Key: HUDI-4468 > URL: https://issues.apache.org/jira/browse/HUDI-4468 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Shawn Chang >Assignee: Alexey Kudinkin >Priority: Major > Fix For: 0.12.1 > > > Existing Hudi relies on .g4 files and antlr classes to make time travel work > for Spark 3.2 > As time travel is supported on Spark 3.3. Those logic can be greatly > simplified and some of them can also be removed -- This message was sent by Atlassian Jira (v8.20.10#820010)