[ 
https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647126#comment-14647126
 ] 

Samphel Norden commented on SPARK-9347:
---------------------------------------

No. I havent tried the latest. Assuming that the distributed/parallel footer 
read is whats in the latest: a) a distributed spark job would still struggle to 
read the data since there are tens of thousands of large parquet files and b) 
correct me if I'm wrong but as per my understanding of hdfs, its not really 
going to only give the last block which contains the parquet footer, so instead 
the part file will be transferred in its entirety to memory which is another 
constraint to deal with. 

the _common_metadata read will and should resolve the above issues 
comprehensively if I understood spark-8838. The only thing I didnt get a 
clarification from was given I have a parititioned folder hieararchy is it 
sufficient to place the _common_metadata file at the top level of the 
hieararchy and point spark to load @ the top level.?

> spark load of existing parquet files extremely slow if large number of files
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-9347
>                 URL: https://issues.apache.org/jira/browse/SPARK-9347
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.3.1
>            Reporter: Samphel Norden
>
> When spark sql shell is launched and we point it to a folder containing a 
> large number of parquet files, the sqlContext.parquetFile() command takes a 
> very long time to load the tables. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to