[ https://issues.apache.org/jira/browse/HUDI-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090098#comment-17090098 ]
Udit Mehrotra commented on HUDI-829: ------------------------------------ You may also want to look at my implementation of custom relation in Spark to read bootstrapped tables [https://github.com/apache/incubator-hudi/pull/1475/files#diff-f14ac7b3cff88313d650b01a56a2b8f8R191] . Here I am building my own file index using spark's InMemoryFileIndex, but the filtering part is just one operation now because once I have all the files, the hudi filesystem view just once to get latest files. Its still work in progress and I am yet to see how fast this is going to be. We can consider moving to a place where our reading in spark happens through our relations, and underneath we use the native readers. > Efficiently reading hudi tables through spark-shell > --------------------------------------------------- > > Key: HUDI-829 > URL: https://issues.apache.org/jira/browse/HUDI-829 > Project: Apache Hudi (incubating) > Issue Type: Task > Components: Spark Integration > Reporter: Nishith Agarwal > Assignee: Nishith Agarwal > Priority: Major > > [~uditme] Created this ticket to track some discussion on read/query path of > spark with Hudi tables. > My understanding is that when you read Hudi tables through spark-shell, some > of your queries are slower due to some sequential activity performed by spark > when interacting with Hudi tables (even with > spark.sql.hive.convertMetastoreParquet which can give you the same data > reading speed and all the vectorization benefits). Is this slowness observed > during spark query planning ? Can you please elaborate on this ? -- This message was sent by Atlassian Jira (v8.3.4#803005)