[jira] [Updated] (HUDI-3881) Implement index syntax for spark sql
[ https://issues.apache.org/jira/browse/HUDI-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Forward Xu updated HUDI-3881: - Description: {code:java} 1.create index CREATE INDEX [IF NOT EXISTS] index_name ON TABLE [db_name.]table_name (column_name, ...) AS bloomfilter/lucene [WITH DEFERRED REFRESH] [PROPERTIES ('key'='value')] {code} was: {code:java} CREATE INDEX [IF NOT EXISTS] index_name ON TABLE [db_name.]table_name (column_name, ...) AS bloomfilter/lucene [WITH DEFERRED REFRESH] [PROPERTIES ('key'='value')] {code} > Implement index syntax for spark sql > > > Key: HUDI-3881 > URL: https://issues.apache.org/jira/browse/HUDI-3881 > Project: Apache Hudi > Issue Type: New Feature > Components: spark-sql >Reporter: Forward Xu >Assignee: Forward Xu >Priority: Major > > {code:java} > 1.create index > CREATE INDEX [IF NOT EXISTS] index_name > ON TABLE [db_name.]table_name (column_name, ...) > AS bloomfilter/lucene > [WITH DEFERRED REFRESH] > [PROPERTIES ('key'='value')] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3892) Add HoodieReadClient with java
[ https://issues.apache.org/jira/browse/HUDI-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Forward Xu updated HUDI-3892: - Description: We might need a hoodie read client in java similar to the one we have for spark. [Apache Pulsar|https://github.com/apache/pulsar] is doing integration with Hudi, and take Hudi as tiered storage to offload topic cold data into Hudi. When consumers fetch cold data from topic, Pulsar broker will locate the target data is stored in Pulsar or not. If the target data stored in tiered storage (Hudi), Pulsar broker will fetch data from Hudi by Java API, and package them into Pulsar format and dispatch to consumer side. However, we found current Hudi implementation doesn't support read Hudi table records by Java API, and we couldn't read the target data out from Hudi into Pulsar Broker, which will block the Pulsar & Hudi integration. h3. What we need # We need Hudi to support reading records by Java API # We need Hudi to support read records out which keep the writer order, or support order by specific fields. was: We might need a hoodie read client in java similar to the one we have for spark. [Apache Pulsar|https://github.com/apache/pulsar] is doing integration with Hudi, and take Hudi as tiered storage to offload topic cold data into Hudi. When consumers fetch cold data from topic, Pulsar broker will locate the target data is stored in Pulsar or not. If the target data stored in tiered storage (Hudi), Pulsar broker will fetch data from Hudi by Java API, and package them into Pulsar format and dispatch to consumer side. However, we found current Hudi implementation doesn't support read Hudi table records by Java API, and we couldn't read the target data out from Hudi into Pulsar Broker, which will block the Pulsar & Hudi integration. h3. What we need # We need Hudi to support reading records by Java API # We need Hudi to support read records out which keep the writer order, or support order by specific fields. > Add HoodieReadClient with java > -- > > Key: HUDI-3892 > URL: https://issues.apache.org/jira/browse/HUDI-3892 > Project: Apache Hudi > Issue Type: Task > Components: reader-core >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.12.0 > > > We might need a hoodie read client in java similar to the one we have for > spark. > [Apache Pulsar|https://github.com/apache/pulsar] is doing integration with > Hudi, and take Hudi as tiered storage to offload topic cold data into Hudi. > When consumers fetch cold data from topic, Pulsar broker will locate the > target data is stored in Pulsar or not. If the target data stored in tiered > storage (Hudi), Pulsar broker will fetch data from Hudi by Java API, and > package them into Pulsar format and dispatch to consumer side. > However, we found current Hudi implementation doesn't support read Hudi table > records by Java API, and we couldn't read the target data out from Hudi into > Pulsar Broker, which will block the Pulsar & Hudi integration. > h3. What we need > # We need Hudi to support reading records by Java API > # We need Hudi to support read records out which keep the writer order, or > support order by specific fields. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3877) Support Java reader for hudi
[ https://issues.apache.org/jira/browse/HUDI-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Forward Xu reassigned HUDI-3877: Assignee: (was: Forward Xu) > Support Java reader for hudi > > > Key: HUDI-3877 > URL: https://issues.apache.org/jira/browse/HUDI-3877 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Simon Su >Priority: Major > > From issue: > [https://github.com/apache/hudi/issues/5313] > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3877) Support Java reader for hudi
[ https://issues.apache.org/jira/browse/HUDI-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Forward Xu reassigned HUDI-3877: Assignee: Forward Xu > Support Java reader for hudi > > > Key: HUDI-3877 > URL: https://issues.apache.org/jira/browse/HUDI-3877 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Simon Su >Assignee: Forward Xu >Priority: Major > > From issue: > [https://github.com/apache/hudi/issues/5313] > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3897) Drop scala 2.11 artifacts
[ https://issues.apache.org/jira/browse/HUDI-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3897: - Description: To reduce complexity in the artifacts. Use scala 12 for all spark bundles Use scala-free flink version - https://flink.apache.org/2022/02/22/scala-free.html Able to get rid of enforcer plugin - https://github.com/apache/hudi/pull/5297#discussion_r848922414 was: To reduce complexity in the artifacts. Use scala 12 for all spark bundles Use scala-free flink version - https://flink.apache.org/2022/02/22/scala-free.html Remove enforcer plugin - https://github.com/apache/hudi/pull/5297#discussion_r848922414 > Drop scala 2.11 artifacts > - > > Key: HUDI-3897 > URL: https://issues.apache.org/jira/browse/HUDI-3897 > Project: Apache Hudi > Issue Type: Task > Components: dependencies >Reporter: Raymond Xu >Priority: Major > > To reduce complexity in the artifacts. > Use scala 12 for all spark bundles > Use scala-free flink version > - https://flink.apache.org/2022/02/22/scala-free.html > Able to get rid of enforcer plugin > - https://github.com/apache/hudi/pull/5297#discussion_r848922414 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3897) Drop scala 2.11 artifacts
[ https://issues.apache.org/jira/browse/HUDI-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3897: - Description: To reduce complexity in the artifacts. Use scala 12 for all spark bundles Use scala-free flink version - https://flink.apache.org/2022/02/22/scala-free.html Remove enforcer plugin - https://github.com/apache/hudi/pull/5297#discussion_r848922414 was:To reduce complexity in the artifacts. > Drop scala 2.11 artifacts > - > > Key: HUDI-3897 > URL: https://issues.apache.org/jira/browse/HUDI-3897 > Project: Apache Hudi > Issue Type: Task > Components: dependencies >Reporter: Raymond Xu >Priority: Major > > To reduce complexity in the artifacts. > Use scala 12 for all spark bundles > Use scala-free flink version > - https://flink.apache.org/2022/02/22/scala-free.html > Remove enforcer plugin > - https://github.com/apache/hudi/pull/5297#discussion_r848922414 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3897) Drop scala 2.11 artifacts
Raymond Xu created HUDI-3897: Summary: Drop scala 2.11 artifacts Key: HUDI-3897 URL: https://issues.apache.org/jira/browse/HUDI-3897 Project: Apache Hudi Issue Type: Task Components: dependencies Reporter: Raymond Xu To reduce complexity in the artifacts. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] simonsssu commented on issue #5313: [SUPPORT] Do we have plan to support java reader for Hudi?
simonsssu commented on issue #5313: URL: https://github.com/apache/hudi/issues/5313#issuecomment-1100803165 @nsivabalan hi nsivabalan, my jira id is HUDI-3877, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5338: [WIP][HUDI-3894] Fix datahub and gcp bundles to include HBase dependencies and shading
hudi-bot commented on PR #5338: URL: https://github.com/apache/hudi/pull/5338#issuecomment-1100757479 ## CI report: * 5cd21a598d455492412ea525feddaa53325b5c9a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8087) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy
[ https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523216#comment-17523216 ] Alexey Kudinkin commented on HUDI-3891: --- So the root-cause of this discrepancy in the amount of data read are 2 issues: HUDI-3895 Missing sorting of the file-splits, resulting in invalid bin-packing of these. HUDI-3896 Inability to apply `SchemaPrunning` optimization rule, since it relies on the `HadoopFsRelation` being used. As a result Spark, when reading the table as raw Parquet is able to effectively prune all but a single field requested from the nested struct: !image-2022-04-16-13-50-43-916.png|width=1446,height=155!!image-2022-04-16-13-50-43-956.png|width=1823,height=165! > Investigate Hudi vs Raw Parquet table discrepancy > - > > Key: HUDI-3891 > URL: https://issues.apache.org/jira/browse/HUDI-3891 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: image-2022-04-16-13-50-43-916.png, > image-2022-04-16-13-50-43-956.png > > > While benchmarking querying raw Parquet tables against Hudi tables, i've run > the test against the same (Hudi) table: > # In one query path i'm reading it as just a raw Parquet table > # In another, i'm reading it as Hudi RO (read_optimized) table > Surprisingly enough, those 2 diverge in the # of files being read: > > _Raw Parquet_ > !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png|width=1691,height=149! > > _Hudi_ > !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png|width=1673,height=142! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy
[ https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3891: -- Description: While benchmarking querying raw Parquet tables against Hudi tables, i've run the test against the same (Hudi) table: # In one query path i'm reading it as just a raw Parquet table # In another, i'm reading it as Hudi RO (read_optimized) table Surprisingly enough, those 2 diverge in the # of files being read: _Raw Parquet_ !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png|width=1691,height=149! _Hudi_ !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png|width=1673,height=142! was: While benchmarking querying raw Parquet tables against Hudi tables, i've run the test against the same (Hudi) table: # In one query path i'm reading it as just a raw Parquet table # In another, i'm reading it as Hudi RO (read_optimized) table Surprisingly enough, those 2 diverge in the # of files being read: _Raw Parquet_ !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png! _Hudi_ !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png! > Investigate Hudi vs Raw Parquet table discrepancy > - > > Key: HUDI-3891 > URL: https://issues.apache.org/jira/browse/HUDI-3891 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: image-2022-04-16-13-50-43-916.png, > image-2022-04-16-13-50-43-956.png > > > While benchmarking querying raw Parquet tables against Hudi tables, i've run > the test against the same (Hudi) table: > # In one query path i'm reading it as just a raw Parquet table > # In another, i'm reading it as Hudi RO (read_optimized) table > Surprisingly enough, those 2 diverge in the # of files being read: > > _Raw Parquet_ > !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png|width=1691,height=149! > > _Hudi_ > !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png|width=1673,height=142! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy
[ https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3891: -- Attachment: image-2022-04-16-13-50-43-916.png > Investigate Hudi vs Raw Parquet table discrepancy > - > > Key: HUDI-3891 > URL: https://issues.apache.org/jira/browse/HUDI-3891 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: image-2022-04-16-13-50-43-916.png, > image-2022-04-16-13-50-43-956.png > > > While benchmarking querying raw Parquet tables against Hudi tables, i've run > the test against the same (Hudi) table: > # In one query path i'm reading it as just a raw Parquet table > # In another, i'm reading it as Hudi RO (read_optimized) table > Surprisingly enough, those 2 diverge in the # of files being read: > > _Raw Parquet_ > !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png|width=1691,height=149! > > _Hudi_ > !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png|width=1673,height=142! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy
[ https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3891: -- Attachment: image-2022-04-16-13-50-43-956.png > Investigate Hudi vs Raw Parquet table discrepancy > - > > Key: HUDI-3891 > URL: https://issues.apache.org/jira/browse/HUDI-3891 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > Attachments: image-2022-04-16-13-50-43-916.png, > image-2022-04-16-13-50-43-956.png > > > While benchmarking querying raw Parquet tables against Hudi tables, i've run > the test against the same (Hudi) table: > # In one query path i'm reading it as just a raw Parquet table > # In another, i'm reading it as Hudi RO (read_optimized) table > Surprisingly enough, those 2 diverge in the # of files being read: > > _Raw Parquet_ > !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png|width=1691,height=149! > > _Hudi_ > !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png|width=1673,height=142! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3896) Support Spark optimizations for `HadoopFsRelation`
[ https://issues.apache.org/jira/browse/HUDI-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3896: -- Description: After migrating to Hudi's own Relation impls, we unfortunately broke off some of the optimizations that Spark apply exclusively for `HadoopFsRelation`. While these optimizations could be perfectly implemented for any `FileRelation`, Spark is unfortunately predicating them on usage of HadoopFsRelation, therefore making them non-applicable to any of the Hudi's relations. Proper longterm solutions would be fixing this in Spark and could be either of: # Generalizing such optimizations to any `FileRelation` # Making `HadoopFsRelation` extensible (making it non-case class) One example of this is Spark's `SchemaPrunning` optimization rule (HUDI-3891): Spark 3.2.x is able to effectively reduce amount of data read via schema pruning (projecting read data) even for nested structs, however this optimization is predicated on the usage of `HadoopFsRelation`: !Screen Shot 2022-04-16 at 1.46.50 PM.png|width=739,height=143! was: After migrating to Hudi's own Relation impls, we unfortunately broke off some of the optimizations that Spark apply exclusively for `HadoopFsRelation`. While these optimizations could be perfectly implemented for any `FileRelation`, Spark is unfortunately predicating them on usage of HadoopFsRelation, therefore making them non-applicable to any of the Hudi's relations. Proper longterm solutions would be fixing this in Spark and could be either of: # Generalizing such optimizations to any `FileRelation` # Making `HadoopFsRelation` extensible (making it non-case class) > Support Spark optimizations for `HadoopFsRelation` > -- > > Key: HUDI-3896 > URL: https://issues.apache.org/jira/browse/HUDI-3896 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.12.0 > > Attachments: Screen Shot 2022-04-16 at 1.46.50 PM.png > > > After migrating to Hudi's own Relation impls, we unfortunately broke off some > of the optimizations that Spark apply exclusively for `HadoopFsRelation`. > > While these optimizations could be perfectly implemented for any > `FileRelation`, Spark is unfortunately predicating them on usage of > HadoopFsRelation, therefore making them non-applicable to any of the Hudi's > relations. > Proper longterm solutions would be fixing this in Spark and could be either > of: > # Generalizing such optimizations to any `FileRelation` > # Making `HadoopFsRelation` extensible (making it non-case class) > > One example of this is Spark's `SchemaPrunning` optimization rule > (HUDI-3891): Spark 3.2.x is able to effectively reduce amount of data read > via schema pruning (projecting read data) even for nested structs, however > this optimization is predicated on the usage of `HadoopFsRelation`: > !Screen Shot 2022-04-16 at 1.46.50 PM.png|width=739,height=143! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3896) Support Spark optimizations for `HadoopFsRelation`
[ https://issues.apache.org/jira/browse/HUDI-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3896: -- Attachment: Screen Shot 2022-04-16 at 1.46.50 PM.png > Support Spark optimizations for `HadoopFsRelation` > -- > > Key: HUDI-3896 > URL: https://issues.apache.org/jira/browse/HUDI-3896 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.12.0 > > Attachments: Screen Shot 2022-04-16 at 1.46.50 PM.png > > > After migrating to Hudi's own Relation impls, we unfortunately broke off some > of the optimizations that Spark apply exclusively for `HadoopFsRelation`. > > While these optimizations could be perfectly implemented for any > `FileRelation`, Spark is unfortunately predicating them on usage of > HadoopFsRelation, therefore making them non-applicable to any of the Hudi's > relations. > Proper longterm solutions would be fixing this in Spark and could be either > of: > # Generalizing such optimizations to any `FileRelation` > # Making `HadoopFsRelation` extensible (making it non-case class) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy
[ https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3891: -- Issue Type: Task (was: Bug) > Investigate Hudi vs Raw Parquet table discrepancy > - > > Key: HUDI-3891 > URL: https://issues.apache.org/jira/browse/HUDI-3891 > Project: Apache Hudi > Issue Type: Task >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > While benchmarking querying raw Parquet tables against Hudi tables, i've run > the test against the same (Hudi) table: > # In one query path i'm reading it as just a raw Parquet table > # In another, i'm reading it as Hudi RO (read_optimized) table > Surprisingly enough, those 2 diverge in the # of files being read: > > _Raw Parquet_ > !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png! > > _Hudi_ > !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3896) Support Spark optimizations for `HadoopFsRelation`
Alexey Kudinkin created HUDI-3896: - Summary: Support Spark optimizations for `HadoopFsRelation` Key: HUDI-3896 URL: https://issues.apache.org/jira/browse/HUDI-3896 Project: Apache Hudi Issue Type: Bug Reporter: Alexey Kudinkin Assignee: Alexey Kudinkin Fix For: 0.12.0 After migrating to Hudi's own Relation impls, we unfortunately broke off some of the optimizations that Spark apply exclusively for `HadoopFsRelation`. While these optimizations could be perfectly implemented for any `FileRelation`, Spark is unfortunately predicating them on usage of HadoopFsRelation, therefore making them non-applicable to any of the Hudi's relations. Proper longterm solutions would be fixing this in Spark and could be either of: # Generalizing such optimizations to any `FileRelation` # Making `HadoopFsRelation` extensible (making it non-case class) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3895) Make sure Hudi relations do proper file-split packing (on par w/ Spark)
[ https://issues.apache.org/jira/browse/HUDI-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3895: -- Status: Patch Available (was: In Progress) > Make sure Hudi relations do proper file-split packing (on par w/ Spark) > --- > > Key: HUDI-3895 > URL: https://issues.apache.org/jira/browse/HUDI-3895 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > > While investigating on HUDI-3891, it was discovered that upon introduction of > Hudi's own Spark's Relation implementations, file-split packing algorithm was > inadvertently subverted: > Spark algorithm does greedy packing which relies on the list of file-splits > being ordered by the file size (descending in order). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3895) Make sure Hudi relations do proper file-split packing (on par w/ Spark)
[ https://issues.apache.org/jira/browse/HUDI-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3895: -- Status: In Progress (was: Open) > Make sure Hudi relations do proper file-split packing (on par w/ Spark) > --- > > Key: HUDI-3895 > URL: https://issues.apache.org/jira/browse/HUDI-3895 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > > While investigating on HUDI-3891, it was discovered that upon introduction of > Hudi's own Spark's Relation implementations, file-split packing algorithm was > inadvertently subverted: > Spark algorithm does greedy packing which relies on the list of file-splits > being ordered by the file size (descending in order). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3895) Make sure Hudi relations do proper file-split packing (on par w/ Spark)
[ https://issues.apache.org/jira/browse/HUDI-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3895: -- Sprint: Hudi-Sprint-Apr-12 > Make sure Hudi relations do proper file-split packing (on par w/ Spark) > --- > > Key: HUDI-3895 > URL: https://issues.apache.org/jira/browse/HUDI-3895 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > > While investigating on HUDI-3891, it was discovered that upon introduction of > Hudi's own Spark's Relation implementations, file-split packing algorithm was > inadvertently subverted: > Spark algorithm does greedy packing which relies on the list of file-splits > being ordered by the file size (descending in order). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy
[ https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3891: -- Priority: Blocker (was: Critical) > Investigate Hudi vs Raw Parquet table discrepancy > - > > Key: HUDI-3891 > URL: https://issues.apache.org/jira/browse/HUDI-3891 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > While benchmarking querying raw Parquet tables against Hudi tables, i've run > the test against the same (Hudi) table: > # In one query path i'm reading it as just a raw Parquet table > # In another, i'm reading it as Hudi RO (read_optimized) table > Surprisingly enough, those 2 diverge in the # of files being read: > > _Raw Parquet_ > !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png! > > _Hudi_ > !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-3738) Perf comparison between parquet and hudi for COW snapshot and MOR read optimized
[ https://issues.apache.org/jira/browse/HUDI-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin closed HUDI-3738. - Resolution: Fixed > Perf comparison between parquet and hudi for COW snapshot and MOR read > optimized > > > Key: HUDI-3738 > URL: https://issues.apache.org/jira/browse/HUDI-3738 > Project: Apache Hudi > Issue Type: Task > Components: performance >Reporter: sivabalan narayanan >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy
[ https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3891: -- Fix Version/s: 0.11.0 > Investigate Hudi vs Raw Parquet table discrepancy > - > > Key: HUDI-3891 > URL: https://issues.apache.org/jira/browse/HUDI-3891 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Critical > Labels: pull-request-available > Fix For: 0.11.0 > > > While benchmarking querying raw Parquet tables against Hudi tables, i've run > the test against the same (Hudi) table: > # In one query path i'm reading it as just a raw Parquet table > # In another, i'm reading it as Hudi RO (read_optimized) table > Surprisingly enough, those 2 diverge in the # of files being read: > > _Raw Parquet_ > !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png! > > _Hudi_ > !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (HUDI-3738) Perf comparison between parquet and hudi for COW snapshot and MOR read optimized
[ https://issues.apache.org/jira/browse/HUDI-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin resolved HUDI-3738. --- > Perf comparison between parquet and hudi for COW snapshot and MOR read > optimized > > > Key: HUDI-3738 > URL: https://issues.apache.org/jira/browse/HUDI-3738 > Project: Apache Hudi > Issue Type: Task > Components: performance >Reporter: sivabalan narayanan >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy
[ https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin closed HUDI-3891. - Resolution: Fixed > Investigate Hudi vs Raw Parquet table discrepancy > - > > Key: HUDI-3891 > URL: https://issues.apache.org/jira/browse/HUDI-3891 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > While benchmarking querying raw Parquet tables against Hudi tables, i've run > the test against the same (Hudi) table: > # In one query path i'm reading it as just a raw Parquet table > # In another, i'm reading it as Hudi RO (read_optimized) table > Surprisingly enough, those 2 diverge in the # of files being read: > > _Raw Parquet_ > !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png! > > _Hudi_ > !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy
[ https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3891: -- Sprint: Hudi-Sprint-Apr-12 > Investigate Hudi vs Raw Parquet table discrepancy > - > > Key: HUDI-3891 > URL: https://issues.apache.org/jira/browse/HUDI-3891 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > While benchmarking querying raw Parquet tables against Hudi tables, i've run > the test against the same (Hudi) table: > # In one query path i'm reading it as just a raw Parquet table > # In another, i'm reading it as Hudi RO (read_optimized) table > Surprisingly enough, those 2 diverge in the # of files being read: > > _Raw Parquet_ > !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png! > > _Hudi_ > !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3895) Make sure Hudi relations do proper file-split packing (on par w/ Spark)
[ https://issues.apache.org/jira/browse/HUDI-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexey Kudinkin updated HUDI-3895: -- Epic Link: HUDI-1297 > Make sure Hudi relations do proper file-split packing (on par w/ Spark) > --- > > Key: HUDI-3895 > URL: https://issues.apache.org/jira/browse/HUDI-3895 > Project: Apache Hudi > Issue Type: Bug >Reporter: Alexey Kudinkin >Assignee: Alexey Kudinkin >Priority: Blocker > Fix For: 0.11.0 > > > While investigating on HUDI-3891, it was discovered that upon introduction of > Hudi's own Spark's Relation implementations, file-split packing algorithm was > inadvertently subverted: > Spark algorithm does greedy packing which relies on the list of file-splits > being ordered by the file size (descending in order). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3895) Make sure Hudi relations do proper file-split packing (on par w/ Spark)
Alexey Kudinkin created HUDI-3895: - Summary: Make sure Hudi relations do proper file-split packing (on par w/ Spark) Key: HUDI-3895 URL: https://issues.apache.org/jira/browse/HUDI-3895 Project: Apache Hudi Issue Type: Bug Reporter: Alexey Kudinkin Assignee: Alexey Kudinkin Fix For: 0.11.0 While investigating on HUDI-3891, it was discovered that upon introduction of Hudi's own Spark's Relation implementations, file-split packing algorithm was inadvertently subverted: Spark algorithm does greedy packing which relies on the list of file-splits being ordered by the file size (descending in order). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5337: [HUDI-3891] Fixing files partitioning sequence for `BaseFileOnlyRelation`
alexeykudinkin commented on code in PR #5337: URL: https://github.com/apache/hudi/pull/5337#discussion_r851667284 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala: ## @@ -84,21 +84,24 @@ class BaseFileOnlyRelation(sqlContext: SQLContext, protected def collectFileSplits(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[HoodieBaseFileSplit] = { val partitions = listLatestBaseFiles(globPaths, partitionFilters, dataFilters) -val fileSplits = partitions.values.toSeq.flatMap { files => - files.flatMap { file => -// TODO move to adapter -// TODO fix, currently assuming parquet as underlying format -HoodieDataSourceHelper.splitFiles( - sparkSession = sparkSession, - file = file, - // TODO clarify why this is required - partitionValues = getPartitionColumnsAsInternalRow(file) -) +val fileSplits = partitions.values.toSeq + .flatMap { files => +files.flatMap { file => + // TODO fix, currently assuming parquet as underlying format + HoodieDataSourceHelper.splitFiles( +sparkSession = sparkSession, +file = file, +partitionValues = getPartitionColumnsAsInternalRow(file) + ) +} } -} + // NOTE: It's important to order the splits in the reverse order of their + // size so that we can subsequently bucket them in an efficient manner + .sortBy(_.length)(implicitly[Ordering[Long]].reverse) Review Comment: We want to maintain parity w/ Spark behavior which simply does greedy packing (which is at most 2x less efficient than optimal) @nsivabalan fair call, i agree that `getFilePartitions` seems like more appropriate place for this (and if it would be placed there in Spark itself we wouldn't have to fix it ourselves), but the reason i'm placing it in here is to keep our code (which is essentially copy of Spark's) in sync (at least for now) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5338: [WIP][HUDI-3894] Fix datahub and gcp bundles to include HBase dependencies and shading
hudi-bot commented on PR #5338: URL: https://github.com/apache/hudi/pull/5338#issuecomment-1100745630 ## CI report: * d619e9656c39f88c5e5c07f4d2b01baaf3e8c64a Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8086) * 5cd21a598d455492412ea525feddaa53325b5c9a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8087) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5338: [HUDI-3894] Fix datahub and gcp bundles to include HBase dependencies and shading
hudi-bot commented on PR #5338: URL: https://github.com/apache/hudi/pull/5338#issuecomment-1100735491 ## CI report: * d619e9656c39f88c5e5c07f4d2b01baaf3e8c64a Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8086) * 5cd21a598d455492412ea525feddaa53325b5c9a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5338: [HUDI-3894] Fix datahub and gcp bundles to include HBase dependencies and shading
hudi-bot commented on PR #5338: URL: https://github.com/apache/hudi/pull/5338#issuecomment-1100735060 ## CI report: * d619e9656c39f88c5e5c07f4d2b01baaf3e8c64a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8086) * 5cd21a598d455492412ea525feddaa53325b5c9a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5338: [HUDI-3894] Fix hudi-datahub-sync-bundle to include HBase dependencies and shading
hudi-bot commented on PR #5338: URL: https://github.com/apache/hudi/pull/5338#issuecomment-1100734574 ## CI report: * d619e9656c39f88c5e5c07f4d2b01baaf3e8c64a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8086) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5338: [HUDI-3894] Fix hudi-datahub-sync-bundle to include HBase dependencies and shading
hudi-bot commented on PR #5338: URL: https://github.com/apache/hudi/pull/5338#issuecomment-1100733740 ## CI report: * d619e9656c39f88c5e5c07f4d2b01baaf3e8c64a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3894) Add HBase dependencies and shading in datahub and gcp bundles
[ https://issues.apache.org/jira/browse/HUDI-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-3894: Summary: Add HBase dependencies and shading in datahub and gcp bundles (was: Add HBase dependencies and shading in datahub-sync-bundle) > Add HBase dependencies and shading in datahub and gcp bundles > - > > Key: HUDI-3894 > URL: https://issues.apache.org/jira/browse/HUDI-3894 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3894) Add HBase dependencies and shading in datahub-sync-bundle
[ https://issues.apache.org/jira/browse/HUDI-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-3894: - Labels: pull-request-available (was: ) > Add HBase dependencies and shading in datahub-sync-bundle > - > > Key: HUDI-3894 > URL: https://issues.apache.org/jira/browse/HUDI-3894 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] yihua opened a new pull request, #5338: [HUDI-3894] Fix hudi-datahub-sync-bundle to include HBase dependencies and shading
yihua opened a new pull request, #5338: URL: https://github.com/apache/hudi/pull/5338 ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] babumahesh-koo commented on issue #5198: [SUPPORT] Querying data genereated by TimestampBasedKeyGenerator failed to parse timestamp in EPOCHMILLISECONDS column to date format
babumahesh-koo commented on issue #5198: URL: https://github.com/apache/hudi/issues/5198#issuecomment-1100731242 @nsivabalan Without Timestamp based key gen, it works. The observation is that, as long as the extracted values data types are matching with original column data type it works. Have tried the same thing with SQL based transformers, my usecase is solved with transformers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-3894) Add HBase dependencies and shading in datahub-sync-bundle
Ethan Guo created HUDI-3894: --- Summary: Add HBase dependencies and shading in datahub-sync-bundle Key: HUDI-3894 URL: https://issues.apache.org/jira/browse/HUDI-3894 Project: Apache Hudi Issue Type: Bug Reporter: Ethan Guo Fix For: 0.11.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] nsivabalan commented on a diff in pull request #5337: [HUDI-3891] Fixing files partitioning sequence for `BaseFileOnlyRelation`
nsivabalan commented on code in PR #5337: URL: https://github.com/apache/hudi/pull/5337#discussion_r851655246 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala: ## @@ -84,21 +84,24 @@ class BaseFileOnlyRelation(sqlContext: SQLContext, protected def collectFileSplits(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[HoodieBaseFileSplit] = { val partitions = listLatestBaseFiles(globPaths, partitionFilters, dataFilters) -val fileSplits = partitions.values.toSeq.flatMap { files => - files.flatMap { file => -// TODO move to adapter -// TODO fix, currently assuming parquet as underlying format -HoodieDataSourceHelper.splitFiles( - sparkSession = sparkSession, - file = file, - // TODO clarify why this is required - partitionValues = getPartitionColumnsAsInternalRow(file) -) +val fileSplits = partitions.values.toSeq + .flatMap { files => +files.flatMap { file => + // TODO fix, currently assuming parquet as underlying format + HoodieDataSourceHelper.splitFiles( +sparkSession = sparkSession, +file = file, +partitionValues = getPartitionColumnsAsInternalRow(file) + ) +} } -} + // NOTE: It's important to order the splits in the reverse order of their + // size so that we can subsequently bucket them in an efficient manner + .sortBy(_.length)(implicitly[Ordering[Long]].reverse) Review Comment: I see within the spark adaptor we use next fit bin packing algo. If thats never going to change, we can leave it here. if not, we can move the sorting within adaptor.getFilePartitions(...) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kasured commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader
kasured commented on issue #5298: URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100717281 @nsivabalan Sure, let me provide more details. There is a StreamingQuery entity which s started by Spark to consume the stream. This is basically what we use and described here https://hudi.apache.org/docs/compaction#spark-structured-streaming So what we do is we create multiple StreamingQuery streams and start them. Each of them though consumes from single kafka topic and writes to single Hudi table. So it is `3 different streaming pipeline writing to 3 diff hudi table but using same spark session` with the only exception that we use 3 different SparkSession objects. Each of them are reusing single sparkContext which is okay as there should be only one spark context per jvm. As to 4753 I have already specified it in the section **Possibly Related Issues** HUDI-3370. However, from what I checked it is related to metadata service which we do not use "hoodie.metadata.enable" = "false". May it also be relevant even if we do not use metadata table? I am asking cause we are using 0.9.0 from Amazon and I will need to replace it with the one with patch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a diff in pull request #5337: [HUDI-3891] Fixing files partitioning sequence for `BaseFileOnlyRelation`
yihua commented on code in PR #5337: URL: https://github.com/apache/hudi/pull/5337#discussion_r851649669 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala: ## @@ -84,21 +84,24 @@ class BaseFileOnlyRelation(sqlContext: SQLContext, protected def collectFileSplits(partitionFilters: Seq[Expression], dataFilters: Seq[Expression]): Seq[HoodieBaseFileSplit] = { val partitions = listLatestBaseFiles(globPaths, partitionFilters, dataFilters) -val fileSplits = partitions.values.toSeq.flatMap { files => - files.flatMap { file => -// TODO move to adapter -// TODO fix, currently assuming parquet as underlying format -HoodieDataSourceHelper.splitFiles( - sparkSession = sparkSession, - file = file, - // TODO clarify why this is required - partitionValues = getPartitionColumnsAsInternalRow(file) -) +val fileSplits = partitions.values.toSeq + .flatMap { files => +files.flatMap { file => + // TODO fix, currently assuming parquet as underlying format + HoodieDataSourceHelper.splitFiles( +sparkSession = sparkSession, +file = file, +partitionValues = getPartitionColumnsAsInternalRow(file) + ) +} } -} + // NOTE: It's important to order the splits in the reverse order of their + // size so that we can subsequently bucket them in an efficient manner + .sortBy(_.length)(implicitly[Ordering[Long]].reverse) Review Comment: I'm wondering, instead of sorting the file splits here, whether we can customize the subsequent bucketing algorithm in the adapter. Also, should the TODOs be kept, since they are not resolved? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wxplovecc commented on issue #5330: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.
wxplovecc commented on issue #5330: URL: https://github.com/apache/hudi/issues/5330#issuecomment-1100693250 see https://github.com/apache/hudi/pull/5185 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader
nsivabalan commented on issue #5298: URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100687324 btw, we did fix an issue wrt how spark lazy initialization and caching of results could result in wrong files in commit metadata https://github.com/apache/hudi/pull/4753. looks like exactly matching what you are reporting. Can you try applying the patch and let us know if you still see the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader
nsivabalan commented on issue #5298: URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100686361 yes, I really appreciate your digging in deeper. let me try to understand the concurrency here. what do you mean by multiple concurrent streaming writes? there are 3 streams reading from diff upstream sources and writing to 1 hudi table? or one streaming pipeline which writes to 3 different hudi tables? or 3 different streaming pipeline writing to 3 diff hudi table but using same spark session ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #5253: Hudi execution plan not generated properly [SUPPORT]
nsivabalan commented on issue #5253: URL: https://github.com/apache/hudi/issues/5253#issuecomment-1100684177 @YannByron @XuQianJin-Stars : can either of you folks please chime in here when you get a chance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3893) Add support to refresh hoodie.properties at regular intervals
[ https://issues.apache.org/jira/browse/HUDI-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523120#comment-17523120 ] sivabalan narayanan commented on HUDI-3893: --- suggested to user do you think you can add a lambda or something for following {{touch ${HUDI_TABLE_PATH}/.hoodie/hoodie.properties}} and that should solve the problem right? > Add support to refresh hoodie.properties at regular intervals > - > > Key: HUDI-3893 > URL: https://issues.apache.org/jira/browse/HUDI-3893 > Project: Apache Hudi > Issue Type: Task >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.12.0 > > > in cloud stores, users could set up lifecycle policy to delete files which > are not touched for say 30 days. So, wrt "hoodie.properties" which is created > once and never updated for the most part, it could get caught with the > lifecycle policy. We can ask users not to set the lifecycle policy, but would > be good to add support to hoodie to make it resilient. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] nsivabalan commented on issue #5281: [SUPPORT] .hoodie/hoodie.properties file can be deleted due to retention settings of cloud providers
nsivabalan commented on issue #5281: URL: https://github.com/apache/hudi/issues/5281#issuecomment-1100683830 do you think you can add a lambda or something for following ``` touch ${HUDI_TABLE_PATH}/.hoodie/hoodie.properties ``` and that should solve the problem right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3893) Add support to refresh hoodie.properties at regular intervals
[ https://issues.apache.org/jira/browse/HUDI-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-3893: -- Fix Version/s: 0.12.0 > Add support to refresh hoodie.properties at regular intervals > - > > Key: HUDI-3893 > URL: https://issues.apache.org/jira/browse/HUDI-3893 > Project: Apache Hudi > Issue Type: Task >Reporter: sivabalan narayanan >Priority: Critical > Fix For: 0.12.0 > > > in cloud stores, users could set up lifecycle policy to delete files which > are not touched for say 30 days. So, wrt "hoodie.properties" which is created > once and never updated for the most part, it could get caught with the > lifecycle policy. We can ask users not to set the lifecycle policy, but would > be good to add support to hoodie to make it resilient. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3893) Add support to refresh hoodie.properties at regular intervals
sivabalan narayanan created HUDI-3893: - Summary: Add support to refresh hoodie.properties at regular intervals Key: HUDI-3893 URL: https://issues.apache.org/jira/browse/HUDI-3893 Project: Apache Hudi Issue Type: Task Reporter: sivabalan narayanan in cloud stores, users could set up lifecycle policy to delete files which are not touched for say 30 days. So, wrt "hoodie.properties" which is created once and never updated for the most part, it could get caught with the lifecycle policy. We can ask users not to set the lifecycle policy, but would be good to add support to hoodie to make it resilient. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] nsivabalan commented on issue #5281: [SUPPORT] .hoodie/hoodie.properties file can be deleted due to retention settings of cloud providers
nsivabalan commented on issue #5281: URL: https://github.com/apache/hudi/issues/5281#issuecomment-1100682574 have filed a tracking ticket https://issues.apache.org/jira/browse/HUDI-3893 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3835) Add UT for delete in java client
[ https://issues.apache.org/jira/browse/HUDI-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3835: - Fix Version/s: 0.12.0 > Add UT for delete in java client > > > Key: HUDI-3835 > URL: https://issues.apache.org/jira/browse/HUDI-3835 > Project: Apache Hudi > Issue Type: Test > Components: writer-core >Reporter: 董可伦 >Assignee: 董可伦 >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] kasured commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader
kasured commented on issue #5298: URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100619414 @nsivabalan Thank you for looking into that. I have updated the configuration in the description as it was a little out of date. Since the creation of the ticket you can see that I have tried multiple options. 1. At first iteration I had foreachBatch which was not causing async compaction to happen (please see the linked issues). After the code was rewritten to use just structured streaming constructs async compaction started to be scheduled and executed. So I have tried both inline enabled with async disabled, and vice versa and the issue that I describe is reproduced in both cases 2. Not sure what you mean as I have cluster disabled explicitly with "hoodie.clustering.inline" = "false". And also I have not seen any clustering actions neither in the .hoodie nor in the logs All in all please check these two sections **Main observations so far** and **Tried Options**. They are up to date and have the summary of all that I have tried so far -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org