[ 
https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34648.
--------------------------------------
    Resolution: Invalid

> Reading Parquet Files in Spark Extremely Slow for Large Number of Files?
> ------------------------------------------------------------------------
>
>                 Key: SPARK-34648
>                 URL: https://issues.apache.org/jira/browse/SPARK-34648
>             Project: Spark
>          Issue Type: Question
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Pankaj Bhootra
>            Priority: Major
>
> Hello Team
> I am new to Spark and this question may be a possible duplicate of the issue 
> highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 
> We have a large dataset partitioned by calendar date, and within each date 
> partition, we are storing the data as *parquet* files in 128 parts.
> We are trying to run aggregation on this dataset for 366 dates at a time with 
> Spark SQL on spark version 2.3.0, hence our Spark job is reading 
> 366*128=46848 partitions, all of which are parquet files. There is currently 
> no *_metadata* or *_common_metadata* file(s) available for this dataset.
> The problem we are facing is that when we try to run *spark.read.parquet* on 
> the above 46848 partitions, our data reads are extremely slow. It takes a 
> long time to run even a simple map task (no shuffling) without any 
> aggregation or group by.
> I read through the above issue and I think I perhaps generally understand the 
> ideas around *_common_metadata* file. But the above issue was raised for 
> Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation related 
> to this metadata file so far.
> I would like to clarify:
>  # What's the latest, best practice for reading large number of parquet files 
> efficiently?
>  # Does this involve using any additional options with spark.read.parquet? 
> How would that work?
>  # Are there other possible reasons for slow data reads apart from reading 
> metadata for every part? We are basically trying to migrate our existing 
> spark pipeline from using csv files to parquet, but from my hands-on so far, 
> it seems that parquet's read time is slower than csv? This seems 
> contradictory to popular opinion that parquet performs better in terms of 
> both computation and storage?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to