[jira] [Updated] (HUDI-3881) Implement index syntax for spark sql

2022-04-16 Thread Forward Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Forward Xu updated HUDI-3881:
-
Description: 
{code:java}
1.create index
CREATE INDEX [IF NOT EXISTS] index_name
ON TABLE [db_name.]table_name (column_name, ...)
AS bloomfilter/lucene
[WITH DEFERRED REFRESH]
[PROPERTIES ('key'='value')] {code}

  was:
{code:java}
CREATE INDEX [IF NOT EXISTS] index_name
ON TABLE [db_name.]table_name (column_name, ...)
AS bloomfilter/lucene
[WITH DEFERRED REFRESH]
[PROPERTIES ('key'='value')] {code}


> Implement index syntax for spark sql
> 
>
> Key: HUDI-3881
> URL: https://issues.apache.org/jira/browse/HUDI-3881
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: spark-sql
>Reporter: Forward Xu
>Assignee: Forward Xu
>Priority: Major
>
> {code:java}
> 1.create index
> CREATE INDEX [IF NOT EXISTS] index_name
> ON TABLE [db_name.]table_name (column_name, ...)
> AS bloomfilter/lucene
> [WITH DEFERRED REFRESH]
> [PROPERTIES ('key'='value')] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3892) Add HoodieReadClient with java

2022-04-16 Thread Forward Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Forward Xu updated HUDI-3892:
-
Description: 
We might need a hoodie read client in java similar to the one we have for 
spark. 

[Apache Pulsar|https://github.com/apache/pulsar] is doing integration with 
Hudi, and take Hudi as tiered storage to offload topic cold data into Hudi. 
When consumers fetch cold data from topic, Pulsar broker will locate the target 
data is stored in Pulsar or not. If the target data stored in tiered storage 
(Hudi), Pulsar broker will fetch data from Hudi by Java API, and package them 
into Pulsar format and dispatch to consumer side.

However, we found current Hudi implementation doesn't support read Hudi table 
records by Java API, and we couldn't read the target data out from Hudi into 
Pulsar Broker, which will block the Pulsar & Hudi integration.
h3. What we need
 # We need Hudi to support reading records by Java API
 # We need Hudi to support read records out which keep the writer order, or 
support order by specific fields.

  was:
We might need a hoodie read client in java similar to the one we have for 
spark. 

 

 

[Apache Pulsar|https://github.com/apache/pulsar] is doing integration with 
Hudi, and take Hudi as tiered storage to offload topic cold data into Hudi. 
When consumers fetch cold data from topic, Pulsar broker will locate the target 
data is stored in Pulsar or not. If the target data stored in tiered storage 
(Hudi), Pulsar broker will fetch data from Hudi by Java API, and package them 
into Pulsar format and dispatch to consumer side.

However, we found current Hudi implementation doesn't support read Hudi table 
records by Java API, and we couldn't read the target data out from Hudi into 
Pulsar Broker, which will block the Pulsar & Hudi integration.
h3. What we need
 # We need Hudi to support reading records by Java API
 # We need Hudi to support read records out which keep the writer order, or 
support order by specific fields.


> Add HoodieReadClient with java
> --
>
> Key: HUDI-3892
> URL: https://issues.apache.org/jira/browse/HUDI-3892
> Project: Apache Hudi
>  Issue Type: Task
>  Components: reader-core
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.0
>
>
> We might need a hoodie read client in java similar to the one we have for 
> spark. 
> [Apache Pulsar|https://github.com/apache/pulsar] is doing integration with 
> Hudi, and take Hudi as tiered storage to offload topic cold data into Hudi. 
> When consumers fetch cold data from topic, Pulsar broker will locate the 
> target data is stored in Pulsar or not. If the target data stored in tiered 
> storage (Hudi), Pulsar broker will fetch data from Hudi by Java API, and 
> package them into Pulsar format and dispatch to consumer side.
> However, we found current Hudi implementation doesn't support read Hudi table 
> records by Java API, and we couldn't read the target data out from Hudi into 
> Pulsar Broker, which will block the Pulsar & Hudi integration.
> h3. What we need
>  # We need Hudi to support reading records by Java API
>  # We need Hudi to support read records out which keep the writer order, or 
> support order by specific fields.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3877) Support Java reader for hudi

2022-04-16 Thread Forward Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Forward Xu reassigned HUDI-3877:


Assignee: (was: Forward Xu)

> Support Java reader for hudi
> 
>
> Key: HUDI-3877
> URL: https://issues.apache.org/jira/browse/HUDI-3877
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Simon Su
>Priority: Major
>
> From issue: 
> [https://github.com/apache/hudi/issues/5313]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3877) Support Java reader for hudi

2022-04-16 Thread Forward Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Forward Xu reassigned HUDI-3877:


Assignee: Forward Xu

> Support Java reader for hudi
> 
>
> Key: HUDI-3877
> URL: https://issues.apache.org/jira/browse/HUDI-3877
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Simon Su
>Assignee: Forward Xu
>Priority: Major
>
> From issue: 
> [https://github.com/apache/hudi/issues/5313]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3897) Drop scala 2.11 artifacts

2022-04-16 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3897:
-
Description: 
To reduce complexity in the artifacts. 

Use scala 12 for all spark bundles
Use scala-free flink version
- https://flink.apache.org/2022/02/22/scala-free.html

Able to get rid of enforcer plugin
- https://github.com/apache/hudi/pull/5297#discussion_r848922414

  was:
To reduce complexity in the artifacts. 

Use scala 12 for all spark bundles
Use scala-free flink version
- https://flink.apache.org/2022/02/22/scala-free.html
Remove enforcer plugin
- https://github.com/apache/hudi/pull/5297#discussion_r848922414


> Drop scala 2.11 artifacts
> -
>
> Key: HUDI-3897
> URL: https://issues.apache.org/jira/browse/HUDI-3897
> Project: Apache Hudi
>  Issue Type: Task
>  Components: dependencies
>Reporter: Raymond Xu
>Priority: Major
>
> To reduce complexity in the artifacts. 
> Use scala 12 for all spark bundles
> Use scala-free flink version
> - https://flink.apache.org/2022/02/22/scala-free.html
> Able to get rid of enforcer plugin
> - https://github.com/apache/hudi/pull/5297#discussion_r848922414



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3897) Drop scala 2.11 artifacts

2022-04-16 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3897:
-
Description: 
To reduce complexity in the artifacts. 

Use scala 12 for all spark bundles
Use scala-free flink version
- https://flink.apache.org/2022/02/22/scala-free.html
Remove enforcer plugin
- https://github.com/apache/hudi/pull/5297#discussion_r848922414

  was:To reduce complexity in the artifacts. 


> Drop scala 2.11 artifacts
> -
>
> Key: HUDI-3897
> URL: https://issues.apache.org/jira/browse/HUDI-3897
> Project: Apache Hudi
>  Issue Type: Task
>  Components: dependencies
>Reporter: Raymond Xu
>Priority: Major
>
> To reduce complexity in the artifacts. 
> Use scala 12 for all spark bundles
> Use scala-free flink version
> - https://flink.apache.org/2022/02/22/scala-free.html
> Remove enforcer plugin
> - https://github.com/apache/hudi/pull/5297#discussion_r848922414



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3897) Drop scala 2.11 artifacts

2022-04-16 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-3897:


 Summary: Drop scala 2.11 artifacts
 Key: HUDI-3897
 URL: https://issues.apache.org/jira/browse/HUDI-3897
 Project: Apache Hudi
  Issue Type: Task
  Components: dependencies
Reporter: Raymond Xu


To reduce complexity in the artifacts. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] simonsssu commented on issue #5313: [SUPPORT] Do we have plan to support java reader for Hudi?

2022-04-16 Thread GitBox


simonsssu commented on issue #5313:
URL: https://github.com/apache/hudi/issues/5313#issuecomment-1100803165

   @nsivabalan hi nsivabalan, my jira id is HUDI-3877, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5338: [WIP][HUDI-3894] Fix datahub and gcp bundles to include HBase dependencies and shading

2022-04-16 Thread GitBox


hudi-bot commented on PR #5338:
URL: https://github.com/apache/hudi/pull/5338#issuecomment-1100757479

   
   ## CI report:
   
   * 5cd21a598d455492412ea525feddaa53325b5c9a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8087)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy

2022-04-16 Thread Alexey Kudinkin (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523216#comment-17523216
 ] 

Alexey Kudinkin commented on HUDI-3891:
---

So the root-cause of this discrepancy in the amount of data read are 2 issues:

 

HUDI-3895

Missing sorting of the file-splits, resulting in invalid bin-packing of these.

 

HUDI-3896

Inability to apply `SchemaPrunning` optimization rule, since it relies on the 
`HadoopFsRelation` being used. As a result Spark, when reading the table as raw 
Parquet is able to effectively prune all but a single field requested from the 
nested struct:

!image-2022-04-16-13-50-43-916.png|width=1446,height=155!!image-2022-04-16-13-50-43-956.png|width=1823,height=165!
 

> Investigate Hudi vs Raw Parquet table discrepancy
> -
>
> Key: HUDI-3891
> URL: https://issues.apache.org/jira/browse/HUDI-3891
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
> Attachments: image-2022-04-16-13-50-43-916.png, 
> image-2022-04-16-13-50-43-956.png
>
>
> While benchmarking querying raw Parquet tables against Hudi tables, i've run 
> the test against the same (Hudi) table:
>  # In one query path i'm reading it as just a raw Parquet table
>  # In another, i'm reading it as Hudi RO (read_optimized) table
> Surprisingly enough, those 2 diverge in the # of files being read:
>  
> _Raw Parquet_
> !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png|width=1691,height=149!
>  
> _Hudi_
> !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png|width=1673,height=142!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3891:
--
Description: 
While benchmarking querying raw Parquet tables against Hudi tables, i've run 
the test against the same (Hudi) table:
 # In one query path i'm reading it as just a raw Parquet table
 # In another, i'm reading it as Hudi RO (read_optimized) table

Surprisingly enough, those 2 diverge in the # of files being read:

 
_Raw Parquet_
!https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png|width=1691,height=149!
 
_Hudi_
!https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png|width=1673,height=142!

  was:
While benchmarking querying raw Parquet tables against Hudi tables, i've run 
the test against the same (Hudi) table:
 # In one query path i'm reading it as just a raw Parquet table
 # In another, i'm reading it as Hudi RO (read_optimized) table


Surprisingly enough, those 2 diverge in the # of files being read:

 
_Raw Parquet_
!https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png!
 
_Hudi_
!https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png!


> Investigate Hudi vs Raw Parquet table discrepancy
> -
>
> Key: HUDI-3891
> URL: https://issues.apache.org/jira/browse/HUDI-3891
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
> Attachments: image-2022-04-16-13-50-43-916.png, 
> image-2022-04-16-13-50-43-956.png
>
>
> While benchmarking querying raw Parquet tables against Hudi tables, i've run 
> the test against the same (Hudi) table:
>  # In one query path i'm reading it as just a raw Parquet table
>  # In another, i'm reading it as Hudi RO (read_optimized) table
> Surprisingly enough, those 2 diverge in the # of files being read:
>  
> _Raw Parquet_
> !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png|width=1691,height=149!
>  
> _Hudi_
> !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png|width=1673,height=142!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3891:
--
Attachment: image-2022-04-16-13-50-43-916.png

> Investigate Hudi vs Raw Parquet table discrepancy
> -
>
> Key: HUDI-3891
> URL: https://issues.apache.org/jira/browse/HUDI-3891
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
> Attachments: image-2022-04-16-13-50-43-916.png, 
> image-2022-04-16-13-50-43-956.png
>
>
> While benchmarking querying raw Parquet tables against Hudi tables, i've run 
> the test against the same (Hudi) table:
>  # In one query path i'm reading it as just a raw Parquet table
>  # In another, i'm reading it as Hudi RO (read_optimized) table
> Surprisingly enough, those 2 diverge in the # of files being read:
>  
> _Raw Parquet_
> !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png|width=1691,height=149!
>  
> _Hudi_
> !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png|width=1673,height=142!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3891:
--
Attachment: image-2022-04-16-13-50-43-956.png

> Investigate Hudi vs Raw Parquet table discrepancy
> -
>
> Key: HUDI-3891
> URL: https://issues.apache.org/jira/browse/HUDI-3891
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
> Attachments: image-2022-04-16-13-50-43-916.png, 
> image-2022-04-16-13-50-43-956.png
>
>
> While benchmarking querying raw Parquet tables against Hudi tables, i've run 
> the test against the same (Hudi) table:
>  # In one query path i'm reading it as just a raw Parquet table
>  # In another, i'm reading it as Hudi RO (read_optimized) table
> Surprisingly enough, those 2 diverge in the # of files being read:
>  
> _Raw Parquet_
> !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png|width=1691,height=149!
>  
> _Hudi_
> !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png|width=1673,height=142!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3896) Support Spark optimizations for `HadoopFsRelation`

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3896:
--
Description: 
After migrating to Hudi's own Relation impls, we unfortunately broke off some 
of the optimizations that Spark apply exclusively for `HadoopFsRelation`.

 

While these optimizations could be perfectly implemented for any 
`FileRelation`, Spark is unfortunately predicating them on usage of 
HadoopFsRelation, therefore making them non-applicable to any of the Hudi's 
relations.

Proper longterm solutions would be fixing this in Spark and could be either of:
 # Generalizing such optimizations to any `FileRelation`
 # Making `HadoopFsRelation` extensible (making it non-case class)

 

One example of this is Spark's `SchemaPrunning` optimization rule (HUDI-3891): 
Spark 3.2.x is able to effectively reduce amount of data read via schema 
pruning (projecting read data) even for nested structs, however this 
optimization is predicated on the usage of `HadoopFsRelation`:

!Screen Shot 2022-04-16 at 1.46.50 PM.png|width=739,height=143!

  was:
After migrating to Hudi's own Relation impls, we unfortunately broke off some 
of the optimizations that Spark apply exclusively for `HadoopFsRelation`.

 

While these optimizations could be perfectly implemented for any 
`FileRelation`, Spark is unfortunately predicating them on usage of 
HadoopFsRelation, therefore making them non-applicable to any of the Hudi's 
relations.

Proper longterm solutions would be fixing this in Spark and could be either of:
 # Generalizing such optimizations to any `FileRelation`
 # Making `HadoopFsRelation` extensible (making it non-case class)


> Support Spark optimizations for `HadoopFsRelation`
> --
>
> Key: HUDI-3896
> URL: https://issues.apache.org/jira/browse/HUDI-3896
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.12.0
>
> Attachments: Screen Shot 2022-04-16 at 1.46.50 PM.png
>
>
> After migrating to Hudi's own Relation impls, we unfortunately broke off some 
> of the optimizations that Spark apply exclusively for `HadoopFsRelation`.
>  
> While these optimizations could be perfectly implemented for any 
> `FileRelation`, Spark is unfortunately predicating them on usage of 
> HadoopFsRelation, therefore making them non-applicable to any of the Hudi's 
> relations.
> Proper longterm solutions would be fixing this in Spark and could be either 
> of:
>  # Generalizing such optimizations to any `FileRelation`
>  # Making `HadoopFsRelation` extensible (making it non-case class)
>  
> One example of this is Spark's `SchemaPrunning` optimization rule 
> (HUDI-3891): Spark 3.2.x is able to effectively reduce amount of data read 
> via schema pruning (projecting read data) even for nested structs, however 
> this optimization is predicated on the usage of `HadoopFsRelation`:
> !Screen Shot 2022-04-16 at 1.46.50 PM.png|width=739,height=143!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3896) Support Spark optimizations for `HadoopFsRelation`

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3896:
--
Attachment: Screen Shot 2022-04-16 at 1.46.50 PM.png

> Support Spark optimizations for `HadoopFsRelation`
> --
>
> Key: HUDI-3896
> URL: https://issues.apache.org/jira/browse/HUDI-3896
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.12.0
>
> Attachments: Screen Shot 2022-04-16 at 1.46.50 PM.png
>
>
> After migrating to Hudi's own Relation impls, we unfortunately broke off some 
> of the optimizations that Spark apply exclusively for `HadoopFsRelation`.
>  
> While these optimizations could be perfectly implemented for any 
> `FileRelation`, Spark is unfortunately predicating them on usage of 
> HadoopFsRelation, therefore making them non-applicable to any of the Hudi's 
> relations.
> Proper longterm solutions would be fixing this in Spark and could be either 
> of:
>  # Generalizing such optimizations to any `FileRelation`
>  # Making `HadoopFsRelation` extensible (making it non-case class)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3891:
--
Issue Type: Task  (was: Bug)

> Investigate Hudi vs Raw Parquet table discrepancy
> -
>
> Key: HUDI-3891
> URL: https://issues.apache.org/jira/browse/HUDI-3891
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> While benchmarking querying raw Parquet tables against Hudi tables, i've run 
> the test against the same (Hudi) table:
>  # In one query path i'm reading it as just a raw Parquet table
>  # In another, i'm reading it as Hudi RO (read_optimized) table
> Surprisingly enough, those 2 diverge in the # of files being read:
>  
> _Raw Parquet_
> !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png!
>  
> _Hudi_
> !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3896) Support Spark optimizations for `HadoopFsRelation`

2022-04-16 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-3896:
-

 Summary: Support Spark optimizations for `HadoopFsRelation`
 Key: HUDI-3896
 URL: https://issues.apache.org/jira/browse/HUDI-3896
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.12.0


After migrating to Hudi's own Relation impls, we unfortunately broke off some 
of the optimizations that Spark apply exclusively for `HadoopFsRelation`.

 

While these optimizations could be perfectly implemented for any 
`FileRelation`, Spark is unfortunately predicating them on usage of 
HadoopFsRelation, therefore making them non-applicable to any of the Hudi's 
relations.

Proper longterm solutions would be fixing this in Spark and could be either of:
 # Generalizing such optimizations to any `FileRelation`
 # Making `HadoopFsRelation` extensible (making it non-case class)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3895) Make sure Hudi relations do proper file-split packing (on par w/ Spark)

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3895:
--
Status: Patch Available  (was: In Progress)

> Make sure Hudi relations do proper file-split packing (on par w/ Spark)
> ---
>
> Key: HUDI-3895
> URL: https://issues.apache.org/jira/browse/HUDI-3895
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> While investigating on HUDI-3891, it was discovered that upon introduction of 
> Hudi's own Spark's Relation implementations, file-split packing algorithm was 
> inadvertently subverted: 
> Spark algorithm does greedy packing which relies on the list of file-splits 
> being ordered by the file size (descending in order).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3895) Make sure Hudi relations do proper file-split packing (on par w/ Spark)

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3895:
--
Status: In Progress  (was: Open)

> Make sure Hudi relations do proper file-split packing (on par w/ Spark)
> ---
>
> Key: HUDI-3895
> URL: https://issues.apache.org/jira/browse/HUDI-3895
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> While investigating on HUDI-3891, it was discovered that upon introduction of 
> Hudi's own Spark's Relation implementations, file-split packing algorithm was 
> inadvertently subverted: 
> Spark algorithm does greedy packing which relies on the list of file-splits 
> being ordered by the file size (descending in order).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3895) Make sure Hudi relations do proper file-split packing (on par w/ Spark)

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3895:
--
Sprint: Hudi-Sprint-Apr-12

> Make sure Hudi relations do proper file-split packing (on par w/ Spark)
> ---
>
> Key: HUDI-3895
> URL: https://issues.apache.org/jira/browse/HUDI-3895
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> While investigating on HUDI-3891, it was discovered that upon introduction of 
> Hudi's own Spark's Relation implementations, file-split packing algorithm was 
> inadvertently subverted: 
> Spark algorithm does greedy packing which relies on the list of file-splits 
> being ordered by the file size (descending in order).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3891:
--
Priority: Blocker  (was: Critical)

> Investigate Hudi vs Raw Parquet table discrepancy
> -
>
> Key: HUDI-3891
> URL: https://issues.apache.org/jira/browse/HUDI-3891
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> While benchmarking querying raw Parquet tables against Hudi tables, i've run 
> the test against the same (Hudi) table:
>  # In one query path i'm reading it as just a raw Parquet table
>  # In another, i'm reading it as Hudi RO (read_optimized) table
> Surprisingly enough, those 2 diverge in the # of files being read:
>  
> _Raw Parquet_
> !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png!
>  
> _Hudi_
> !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3738) Perf comparison between parquet and hudi for COW snapshot and MOR read optimized

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-3738.
-
Resolution: Fixed

> Perf comparison between parquet and hudi for COW snapshot and MOR read 
> optimized
> 
>
> Key: HUDI-3738
> URL: https://issues.apache.org/jira/browse/HUDI-3738
> Project: Apache Hudi
>  Issue Type: Task
>  Components: performance
>Reporter: sivabalan narayanan
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3891:
--
Fix Version/s: 0.11.0

> Investigate Hudi vs Raw Parquet table discrepancy
> -
>
> Key: HUDI-3891
> URL: https://issues.apache.org/jira/browse/HUDI-3891
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> While benchmarking querying raw Parquet tables against Hudi tables, i've run 
> the test against the same (Hudi) table:
>  # In one query path i'm reading it as just a raw Parquet table
>  # In another, i'm reading it as Hudi RO (read_optimized) table
> Surprisingly enough, those 2 diverge in the # of files being read:
>  
> _Raw Parquet_
> !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png!
>  
> _Hudi_
> !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HUDI-3738) Perf comparison between parquet and hudi for COW snapshot and MOR read optimized

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin resolved HUDI-3738.
---

> Perf comparison between parquet and hudi for COW snapshot and MOR read 
> optimized
> 
>
> Key: HUDI-3738
> URL: https://issues.apache.org/jira/browse/HUDI-3738
> Project: Apache Hudi
>  Issue Type: Task
>  Components: performance
>Reporter: sivabalan narayanan
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-3891.
-
Resolution: Fixed

> Investigate Hudi vs Raw Parquet table discrepancy
> -
>
> Key: HUDI-3891
> URL: https://issues.apache.org/jira/browse/HUDI-3891
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> While benchmarking querying raw Parquet tables against Hudi tables, i've run 
> the test against the same (Hudi) table:
>  # In one query path i'm reading it as just a raw Parquet table
>  # In another, i'm reading it as Hudi RO (read_optimized) table
> Surprisingly enough, those 2 diverge in the # of files being read:
>  
> _Raw Parquet_
> !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png!
>  
> _Hudi_
> !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3891) Investigate Hudi vs Raw Parquet table discrepancy

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3891:
--
Sprint: Hudi-Sprint-Apr-12

> Investigate Hudi vs Raw Parquet table discrepancy
> -
>
> Key: HUDI-3891
> URL: https://issues.apache.org/jira/browse/HUDI-3891
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> While benchmarking querying raw Parquet tables against Hudi tables, i've run 
> the test against the same (Hudi) table:
>  # In one query path i'm reading it as just a raw Parquet table
>  # In another, i'm reading it as Hudi RO (read_optimized) table
> Surprisingly enough, those 2 diverge in the # of files being read:
>  
> _Raw Parquet_
> !https://t18029943.p.clickup-attachments.com/t18029943/f700a129-35bc-4aaa-948c-9495392653f2/Screen%20Shot%202022-04-15%20at%205.20.41%20PM.png!
>  
> _Hudi_
> !https://t18029943.p.clickup-attachments.com/t18029943/d063c689-a254-45cf-8ba5-07fc88b354b6/Screen%20Shot%202022-04-15%20at%205.21.33%20PM.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3895) Make sure Hudi relations do proper file-split packing (on par w/ Spark)

2022-04-16 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3895:
--
Epic Link: HUDI-1297

> Make sure Hudi relations do proper file-split packing (on par w/ Spark)
> ---
>
> Key: HUDI-3895
> URL: https://issues.apache.org/jira/browse/HUDI-3895
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.11.0
>
>
> While investigating on HUDI-3891, it was discovered that upon introduction of 
> Hudi's own Spark's Relation implementations, file-split packing algorithm was 
> inadvertently subverted: 
> Spark algorithm does greedy packing which relies on the list of file-splits 
> being ordered by the file size (descending in order).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3895) Make sure Hudi relations do proper file-split packing (on par w/ Spark)

2022-04-16 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-3895:
-

 Summary: Make sure Hudi relations do proper file-split packing (on 
par w/ Spark)
 Key: HUDI-3895
 URL: https://issues.apache.org/jira/browse/HUDI-3895
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.11.0


While investigating on HUDI-3891, it was discovered that upon introduction of 
Hudi's own Spark's Relation implementations, file-split packing algorithm was 
inadvertently subverted: 

Spark algorithm does greedy packing which relies on the list of file-splits 
being ordered by the file size (descending in order).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5337: [HUDI-3891] Fixing files partitioning sequence for `BaseFileOnlyRelation`

2022-04-16 Thread GitBox


alexeykudinkin commented on code in PR #5337:
URL: https://github.com/apache/hudi/pull/5337#discussion_r851667284


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##
@@ -84,21 +84,24 @@ class BaseFileOnlyRelation(sqlContext: SQLContext,
 
   protected def collectFileSplits(partitionFilters: Seq[Expression], 
dataFilters: Seq[Expression]): Seq[HoodieBaseFileSplit] = {
 val partitions = listLatestBaseFiles(globPaths, partitionFilters, 
dataFilters)
-val fileSplits = partitions.values.toSeq.flatMap { files =>
-  files.flatMap { file =>
-// TODO move to adapter
-// TODO fix, currently assuming parquet as underlying format
-HoodieDataSourceHelper.splitFiles(
-  sparkSession = sparkSession,
-  file = file,
-  // TODO clarify why this is required
-  partitionValues = getPartitionColumnsAsInternalRow(file)
-)
+val fileSplits = partitions.values.toSeq
+  .flatMap { files =>
+files.flatMap { file =>
+  // TODO fix, currently assuming parquet as underlying format
+  HoodieDataSourceHelper.splitFiles(
+sparkSession = sparkSession,
+file = file,
+partitionValues = getPartitionColumnsAsInternalRow(file)
+  )
+}
   }
-}
+  // NOTE: It's important to order the splits in the reverse order of their
+  //   size so that we can subsequently bucket them in an efficient 
manner
+  .sortBy(_.length)(implicitly[Ordering[Long]].reverse)

Review Comment:
   We want to maintain parity w/ Spark behavior which simply does greedy 
packing (which is at most 2x less efficient than optimal)
   
   @nsivabalan fair call, i agree that `getFilePartitions` seems like more 
appropriate place for this (and if it would be placed there in Spark itself we 
wouldn't have to fix it ourselves), but the reason i'm placing it in here is to 
keep our code (which is essentially copy of Spark's) in sync (at least for now)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5338: [WIP][HUDI-3894] Fix datahub and gcp bundles to include HBase dependencies and shading

2022-04-16 Thread GitBox


hudi-bot commented on PR #5338:
URL: https://github.com/apache/hudi/pull/5338#issuecomment-1100745630

   
   ## CI report:
   
   * d619e9656c39f88c5e5c07f4d2b01baaf3e8c64a Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8086)
 
   * 5cd21a598d455492412ea525feddaa53325b5c9a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8087)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5338: [HUDI-3894] Fix datahub and gcp bundles to include HBase dependencies and shading

2022-04-16 Thread GitBox


hudi-bot commented on PR #5338:
URL: https://github.com/apache/hudi/pull/5338#issuecomment-1100735491

   
   ## CI report:
   
   * d619e9656c39f88c5e5c07f4d2b01baaf3e8c64a Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8086)
 
   * 5cd21a598d455492412ea525feddaa53325b5c9a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5338: [HUDI-3894] Fix datahub and gcp bundles to include HBase dependencies and shading

2022-04-16 Thread GitBox


hudi-bot commented on PR #5338:
URL: https://github.com/apache/hudi/pull/5338#issuecomment-1100735060

   
   ## CI report:
   
   * d619e9656c39f88c5e5c07f4d2b01baaf3e8c64a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8086)
 
   * 5cd21a598d455492412ea525feddaa53325b5c9a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5338: [HUDI-3894] Fix hudi-datahub-sync-bundle to include HBase dependencies and shading

2022-04-16 Thread GitBox


hudi-bot commented on PR #5338:
URL: https://github.com/apache/hudi/pull/5338#issuecomment-1100734574

   
   ## CI report:
   
   * d619e9656c39f88c5e5c07f4d2b01baaf3e8c64a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8086)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5338: [HUDI-3894] Fix hudi-datahub-sync-bundle to include HBase dependencies and shading

2022-04-16 Thread GitBox


hudi-bot commented on PR #5338:
URL: https://github.com/apache/hudi/pull/5338#issuecomment-1100733740

   
   ## CI report:
   
   * d619e9656c39f88c5e5c07f4d2b01baaf3e8c64a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3894) Add HBase dependencies and shading in datahub and gcp bundles

2022-04-16 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3894:

Summary: Add HBase dependencies and shading in datahub and gcp bundles  
(was: Add HBase dependencies and shading in datahub-sync-bundle)

> Add HBase dependencies and shading in datahub and gcp bundles
> -
>
> Key: HUDI-3894
> URL: https://issues.apache.org/jira/browse/HUDI-3894
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3894) Add HBase dependencies and shading in datahub-sync-bundle

2022-04-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3894:
-
Labels: pull-request-available  (was: )

> Add HBase dependencies and shading in datahub-sync-bundle
> -
>
> Key: HUDI-3894
> URL: https://issues.apache.org/jira/browse/HUDI-3894
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] yihua opened a new pull request, #5338: [HUDI-3894] Fix hudi-datahub-sync-bundle to include HBase dependencies and shading

2022-04-16 Thread GitBox


yihua opened a new pull request, #5338:
URL: https://github.com/apache/hudi/pull/5338

   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] babumahesh-koo commented on issue #5198: [SUPPORT] Querying data genereated by TimestampBasedKeyGenerator failed to parse timestamp in EPOCHMILLISECONDS column to date format

2022-04-16 Thread GitBox


babumahesh-koo commented on issue #5198:
URL: https://github.com/apache/hudi/issues/5198#issuecomment-1100731242

   @nsivabalan Without Timestamp based key gen, it works. 
   
   The observation is that, as long as the extracted values data types are 
matching with original column data type it works.
   
   Have tried the same thing with SQL based transformers, my usecase is solved 
with transformers. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-3894) Add HBase dependencies and shading in datahub-sync-bundle

2022-04-16 Thread Ethan Guo (Jira)
Ethan Guo created HUDI-3894:
---

 Summary: Add HBase dependencies and shading in datahub-sync-bundle
 Key: HUDI-3894
 URL: https://issues.apache.org/jira/browse/HUDI-3894
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ethan Guo
 Fix For: 0.11.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] nsivabalan commented on a diff in pull request #5337: [HUDI-3891] Fixing files partitioning sequence for `BaseFileOnlyRelation`

2022-04-16 Thread GitBox


nsivabalan commented on code in PR #5337:
URL: https://github.com/apache/hudi/pull/5337#discussion_r851655246


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##
@@ -84,21 +84,24 @@ class BaseFileOnlyRelation(sqlContext: SQLContext,
 
   protected def collectFileSplits(partitionFilters: Seq[Expression], 
dataFilters: Seq[Expression]): Seq[HoodieBaseFileSplit] = {
 val partitions = listLatestBaseFiles(globPaths, partitionFilters, 
dataFilters)
-val fileSplits = partitions.values.toSeq.flatMap { files =>
-  files.flatMap { file =>
-// TODO move to adapter
-// TODO fix, currently assuming parquet as underlying format
-HoodieDataSourceHelper.splitFiles(
-  sparkSession = sparkSession,
-  file = file,
-  // TODO clarify why this is required
-  partitionValues = getPartitionColumnsAsInternalRow(file)
-)
+val fileSplits = partitions.values.toSeq
+  .flatMap { files =>
+files.flatMap { file =>
+  // TODO fix, currently assuming parquet as underlying format
+  HoodieDataSourceHelper.splitFiles(
+sparkSession = sparkSession,
+file = file,
+partitionValues = getPartitionColumnsAsInternalRow(file)
+  )
+}
   }
-}
+  // NOTE: It's important to order the splits in the reverse order of their
+  //   size so that we can subsequently bucket them in an efficient 
manner
+  .sortBy(_.length)(implicitly[Ordering[Long]].reverse)

Review Comment:
   I see within the spark adaptor we use next fit  bin packing algo. If thats 
never going to change, we can leave it here. if not, we can move the sorting 
within adaptor.getFilePartitions(...) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kasured commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

2022-04-16 Thread GitBox


kasured commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100717281

   @nsivabalan Sure, let me provide more details. There is a StreamingQuery 
entity which s started by Spark to consume the stream. This is basically what 
we use and described here 
https://hudi.apache.org/docs/compaction#spark-structured-streaming
   
   So what we do is we create multiple StreamingQuery streams and start them. 
Each of them though consumes from single kafka topic and writes to single Hudi 
table. So it is `3 different streaming pipeline writing to 3 diff hudi table 
but using same spark session` with the only exception that we use 3 different 
SparkSession objects. Each of them are reusing single sparkContext which is 
okay as there should be only one spark context per jvm.
   
   As to 4753 I have already specified it in the section **Possibly Related 
Issues** HUDI-3370. However, from what I checked it is related to metadata 
service which we do not use "hoodie.metadata.enable" = "false". May it also be 
relevant even if we do not use metadata table? I am asking cause we are using 
0.9.0 from Amazon and I will need to replace it with the one with patch  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on a diff in pull request #5337: [HUDI-3891] Fixing files partitioning sequence for `BaseFileOnlyRelation`

2022-04-16 Thread GitBox


yihua commented on code in PR #5337:
URL: https://github.com/apache/hudi/pull/5337#discussion_r851649669


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##
@@ -84,21 +84,24 @@ class BaseFileOnlyRelation(sqlContext: SQLContext,
 
   protected def collectFileSplits(partitionFilters: Seq[Expression], 
dataFilters: Seq[Expression]): Seq[HoodieBaseFileSplit] = {
 val partitions = listLatestBaseFiles(globPaths, partitionFilters, 
dataFilters)
-val fileSplits = partitions.values.toSeq.flatMap { files =>
-  files.flatMap { file =>
-// TODO move to adapter
-// TODO fix, currently assuming parquet as underlying format
-HoodieDataSourceHelper.splitFiles(
-  sparkSession = sparkSession,
-  file = file,
-  // TODO clarify why this is required
-  partitionValues = getPartitionColumnsAsInternalRow(file)
-)
+val fileSplits = partitions.values.toSeq
+  .flatMap { files =>
+files.flatMap { file =>
+  // TODO fix, currently assuming parquet as underlying format
+  HoodieDataSourceHelper.splitFiles(
+sparkSession = sparkSession,
+file = file,
+partitionValues = getPartitionColumnsAsInternalRow(file)
+  )
+}
   }
-}
+  // NOTE: It's important to order the splits in the reverse order of their
+  //   size so that we can subsequently bucket them in an efficient 
manner
+  .sortBy(_.length)(implicitly[Ordering[Long]].reverse)

Review Comment:
   I'm wondering, instead of sorting the file splits here, whether we can 
customize the subsequent bucketing algorithm in the adapter.  Also, should the 
TODOs be kept, since they are not resolved? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wxplovecc commented on issue #5330: [SUPPORT] [BUG] Duplicate fileID ??? from bucket ?? of partition found during the BucketStreamWriteFunction index bootstrap.

2022-04-16 Thread GitBox


wxplovecc commented on issue #5330:
URL: https://github.com/apache/hudi/issues/5330#issuecomment-1100693250

   see https://github.com/apache/hudi/pull/5185


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

2022-04-16 Thread GitBox


nsivabalan commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100687324

   btw, we did fix an issue wrt how spark lazy initialization and caching of 
results could result in wrong files in commit metadata 
https://github.com/apache/hudi/pull/4753. looks like exactly matching what you 
are reporting. 
   Can you try applying the patch and let us know if you still see the issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

2022-04-16 Thread GitBox


nsivabalan commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100686361

   yes, I really appreciate your digging in deeper. 
   let me try to understand the concurrency here. 
   what do you mean by multiple concurrent streaming writes? there are 3 
streams reading from diff upstream sources and writing to 1 hudi table? or one 
streaming pipeline which writes to 3 different hudi tables? or 3 different 
streaming pipeline writing to 3 diff hudi table but using same spark session ? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #5253: Hudi execution plan not generated properly [SUPPORT]

2022-04-16 Thread GitBox


nsivabalan commented on issue #5253:
URL: https://github.com/apache/hudi/issues/5253#issuecomment-1100684177

   @YannByron @XuQianJin-Stars : can either of you folks please chime in here 
when you get a chance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-3893) Add support to refresh hoodie.properties at regular intervals

2022-04-16 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523120#comment-17523120
 ] 

sivabalan narayanan commented on HUDI-3893:
---

suggested to user

 

do you think you can add a lambda or something for following
 {{touch ${HUDI_TABLE_PATH}/.hoodie/hoodie.properties}}
and that should solve the problem right?

 

> Add support to refresh hoodie.properties at regular intervals
> -
>
> Key: HUDI-3893
> URL: https://issues.apache.org/jira/browse/HUDI-3893
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.0
>
>
> in cloud stores, users could set up lifecycle policy to delete files which 
> are not touched for say 30 days. So, wrt "hoodie.properties" which is created 
> once and never updated for the most part, it could get caught with the 
> lifecycle policy. We can ask users not to set the lifecycle policy, but would 
> be good to add support to hoodie to make it resilient. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] nsivabalan commented on issue #5281: [SUPPORT] .hoodie/hoodie.properties file can be deleted due to retention settings of cloud providers

2022-04-16 Thread GitBox


nsivabalan commented on issue #5281:
URL: https://github.com/apache/hudi/issues/5281#issuecomment-1100683830

   do you think you can add a lambda or something for following 
   ```
   touch ${HUDI_TABLE_PATH}/.hoodie/hoodie.properties
   ```
   and that should solve the problem right? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3893) Add support to refresh hoodie.properties at regular intervals

2022-04-16 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3893:
--
Fix Version/s: 0.12.0

> Add support to refresh hoodie.properties at regular intervals
> -
>
> Key: HUDI-3893
> URL: https://issues.apache.org/jira/browse/HUDI-3893
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Priority: Critical
> Fix For: 0.12.0
>
>
> in cloud stores, users could set up lifecycle policy to delete files which 
> are not touched for say 30 days. So, wrt "hoodie.properties" which is created 
> once and never updated for the most part, it could get caught with the 
> lifecycle policy. We can ask users not to set the lifecycle policy, but would 
> be good to add support to hoodie to make it resilient. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3893) Add support to refresh hoodie.properties at regular intervals

2022-04-16 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-3893:
-

 Summary: Add support to refresh hoodie.properties at regular 
intervals
 Key: HUDI-3893
 URL: https://issues.apache.org/jira/browse/HUDI-3893
 Project: Apache Hudi
  Issue Type: Task
Reporter: sivabalan narayanan


in cloud stores, users could set up lifecycle policy to delete files which are 
not touched for say 30 days. So, wrt "hoodie.properties" which is created once 
and never updated for the most part, it could get caught with the lifecycle 
policy. We can ask users not to set the lifecycle policy, but would be good to 
add support to hoodie to make it resilient. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] nsivabalan commented on issue #5281: [SUPPORT] .hoodie/hoodie.properties file can be deleted due to retention settings of cloud providers

2022-04-16 Thread GitBox


nsivabalan commented on issue #5281:
URL: https://github.com/apache/hudi/issues/5281#issuecomment-1100682574

   have filed a tracking ticket https://issues.apache.org/jira/browse/HUDI-3893
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3835) Add UT for delete in java client

2022-04-16 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3835:
-
Fix Version/s: 0.12.0

> Add UT for delete in java client
> 
>
> Key: HUDI-3835
> URL: https://issues.apache.org/jira/browse/HUDI-3835
> Project: Apache Hudi
>  Issue Type: Test
>  Components: writer-core
>Reporter: 董可伦
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] kasured commented on issue #5298: [SUPPORT] File is deleted during inline compaction on MOR table causing subsequent FileNotFoundException on a reader

2022-04-16 Thread GitBox


kasured commented on issue #5298:
URL: https://github.com/apache/hudi/issues/5298#issuecomment-1100619414

   @nsivabalan Thank you for looking into that. I have updated the 
configuration in the description as it was a little out of date. Since the 
creation of the ticket you can see that I have tried multiple options.
   
   1. At first iteration I had foreachBatch which was not causing async 
compaction to happen (please see the linked issues). After the code was 
rewritten to use just structured streaming constructs async compaction started 
to be scheduled and executed. So I have tried both inline enabled with async 
disabled, and vice versa and the issue that I describe is reproduced in both 
cases
   2. Not sure what you mean as I have cluster disabled explicitly with 
"hoodie.clustering.inline" = "false". And also I have not seen any clustering 
actions neither in the .hoodie nor in the logs 
   
   All in all please check these two sections **Main observations so far** and 
**Tried Options**. They are up to date and have the summary of all that I have 
tried so far


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org