[ https://issues.apache.org/jira/browse/SPARK-9347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647126#comment-14647126 ]
Samphel Norden commented on SPARK-9347: --------------------------------------- No. I havent tried the latest. Assuming that the distributed/parallel footer read is whats in the latest: a) a distributed spark job would still struggle to read the data since there are tens of thousands of large parquet files and b) correct me if I'm wrong but as per my understanding of hdfs, its not really going to only give the last block which contains the parquet footer, so instead the part file will be transferred in its entirety to memory which is another constraint to deal with. the _common_metadata read will and should resolve the above issues comprehensively if I understood spark-8838. The only thing I didnt get a clarification from was given I have a parititioned folder hieararchy is it sufficient to place the _common_metadata file at the top level of the hieararchy and point spark to load @ the top level.? > spark load of existing parquet files extremely slow if large number of files > ---------------------------------------------------------------------------- > > Key: SPARK-9347 > URL: https://issues.apache.org/jira/browse/SPARK-9347 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.3.1 > Reporter: Samphel Norden > > When spark sql shell is launched and we point it to a folder containing a > large number of parquet files, the sqlContext.parquetFile() command takes a > very long time to load the tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org