[ 
https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090098#comment-17090098
 ] 

Udit Mehrotra commented on HUDI-829:
------------------------------------

You may also want to look at my implementation of custom relation in Spark to 
read bootstrapped tables 
[https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191]
 . Here I am building my own file index using spark's InMemoryFileIndex, but 
the filtering part is just one operation now because once I have all the files, 
the hudi filesystem view just once to get latest files. Its still work in 
progress and I am yet to see how fast this is going to be. We can consider 
moving to a place where our reading in spark happens through our relations, and 
underneath we use the native readers.

> Efficiently reading hudi tables through spark-shell
> ---------------------------------------------------
>
>                 Key: HUDI-829
>                 URL: https://issues.apache.org/jira/browse/HUDI-829
>             Project: Apache Hudi (incubating)
>          Issue Type: Task
>          Components: Spark Integration
>            Reporter: Nishith Agarwal
>            Assignee: Nishith Agarwal
>            Priority: Major
>
> [~uditme] Created this ticket to track some discussion on read/query path of 
> spark with Hudi tables. 
> My understanding is that when you read Hudi tables through spark-shell, some 
> of your queries are slower due to some sequential activity performed by spark 
> when interacting with Hudi tables (even with 
> spark.sql.hive.convertMetastoreParquet which can give you the same data 
> reading speed and all the vectorization benefits). Is this slowness observed 
> during spark query planning ? Can you please elaborate on this ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to