subject:"\[jira\] \[Commented\] \(SPARK\-34648\) Reading Parquet Files in Spark Extremely Slow for Large Number of Files\?"

Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-10 Thread 钟雨

Hi Pankaj,

Can you show your detail code and Job/Stage Info? Which Stage is slow?


Pankaj Bhootra  于2021年3月10日周三 下午12:32写道：

> Hi,
>
> Could someone please revert on this?
>
>
> Thanks
> Pankaj Bhootra
>
>
> On Sun, 7 Mar 2021, 01:22 Pankaj Bhootra,  wrote:
>
>> Hello Team
>>
>> I am new to Spark and this question may be a possible duplicate of the
>> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
>>
>> We have a large dataset partitioned by calendar date, and within each
>> date partition, we are storing the data as *parquet* files in 128 parts.
>>
>> We are trying to run aggregation on this dataset for 366 dates at a time
>> with Spark SQL on spark version 2.3.0, hence our Spark job is reading
>> 366*128=46848 partitions, all of which are parquet files. There is
>> currently no *_metadata* or *_common_metadata* file(s) available for
>> this dataset.
>>
>> The problem we are facing is that when we try to run *spark.read.parquet* on
>> the above 46848 partitions, our data reads are extremely slow. It takes a
>> long time to run even a simple map task (no shuffling) without any
>> aggregation or group by.
>>
>> I read through the above issue and I think I perhaps generally understand
>> the ideas around *_common_metadata* file. But the above issue was raised
>> for Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation
>> related to this metadata file so far.
>>
>> I would like to clarify:
>>
>>1. What's the latest, best practice for reading large number of
>>parquet files efficiently?
>>2. Does this involve using any additional options with
>>spark.read.parquet? How would that work?
>>3. Are there other possible reasons for slow data reads apart from
>>reading metadata for every part? We are basically trying to migrate our
>>existing spark pipeline from using csv files to parquet, but from my
>>hands-on so far, it seems that parquet's read time is slower than csv? 
>> This
>>seems contradictory to popular opinion that parquet performs better in
>>terms of both computation and storage?
>>
>>
>> Thanks
>> Pankaj Bhootra
>>
>>
>>
>> -- Forwarded message -
>> From: Takeshi Yamamuro (Jira) 
>> Date: Sat, 6 Mar 2021, 20:02
>> Subject: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark
>> Extremely Slow for Large Number of Files?
>> To: 
>>
>>
>>
>> [
>> https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296528#comment-17296528
>> ]
>>
>> Takeshi Yamamuro commented on SPARK-34648:
>> --
>>
>> Please use the mailing list (user@spark.apache.org) instead. This is not
>> a right place to ask questions.
>>
>> > Reading Parquet Files in Spark Extremely Slow for Large Number of Files?
>> > 
>> >
>> > Key: SPARK-34648
>> > URL: https://issues.apache.org/jira/browse/SPARK-34648
>> > Project: Spark
>> >  Issue Type: Question
>> >  Components: SQL
>> >Affects Versions: 2.3.0
>> >Reporter: Pankaj Bhootra
>> >Priority: Major
>> >
>> > Hello Team
>> > I am new to Spark and this question may be a possible duplicate of the
>> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
>> > We have a large dataset partitioned by calendar date, and within each
>> date partition, we are storing the data as *parquet* files in 128 parts.
>> > We are trying to run aggregation on this dataset for 366 dates at a
>> time with Spark SQL on spark version 2.3.0, hence our Spark job is reading
>> 366*128=46848 partitions, all of which are parquet files. There is
>> currently no *_metadata* or *_common_metadata* file(s) available for this
>> dataset.
>> > The problem we are facing is that when we try to run
>> *spark.read.parquet* on the above 46848 partitions, our data reads are
>> extremely slow. It takes a long time to run even a simple map task (no
>> shuffling) without any aggregation or group by.
>> > I read through the above issue and I think I perhaps generally
>> understand the ideas around *_common_metadata* file. But the above issue
>> was raised for Spark 1.3.1 and for Spark 2.3.0, I have not found any
>> d

Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-09 Thread Pankaj Bhootra

Hi,

Could someone please revert on this?


Thanks
Pankaj Bhootra


On Sun, 7 Mar 2021, 01:22 Pankaj Bhootra,  wrote:

> Hello Team
>
> I am new to Spark and this question may be a possible duplicate of the
> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
>
> We have a large dataset partitioned by calendar date, and within each date
> partition, we are storing the data as *parquet* files in 128 parts.
>
> We are trying to run aggregation on this dataset for 366 dates at a time
> with Spark SQL on spark version 2.3.0, hence our Spark job is reading
> 366*128=46848 partitions, all of which are parquet files. There is
> currently no *_metadata* or *_common_metadata* file(s) available for this
> dataset.
>
> The problem we are facing is that when we try to run *spark.read.parquet* on
> the above 46848 partitions, our data reads are extremely slow. It takes a
> long time to run even a simple map task (no shuffling) without any
> aggregation or group by.
>
> I read through the above issue and I think I perhaps generally understand
> the ideas around *_common_metadata* file. But the above issue was raised
> for Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation
> related to this metadata file so far.
>
> I would like to clarify:
>
>1. What's the latest, best practice for reading large number of
>parquet files efficiently?
>2. Does this involve using any additional options with
>spark.read.parquet? How would that work?
>3. Are there other possible reasons for slow data reads apart from
>reading metadata for every part? We are basically trying to migrate our
>existing spark pipeline from using csv files to parquet, but from my
>hands-on so far, it seems that parquet's read time is slower than csv? This
>seems contradictory to popular opinion that parquet performs better in
>terms of both computation and storage?
>
>
> Thanks
> Pankaj Bhootra
>
>
>
> ------ Forwarded message -----
> From: Takeshi Yamamuro (Jira) 
> Date: Sat, 6 Mar 2021, 20:02
> Subject: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark
> Extremely Slow for Large Number of Files?
> To: 
>
>
>
> [
> https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296528#comment-17296528
> ]
>
> Takeshi Yamamuro commented on SPARK-34648:
> --
>
> Please use the mailing list (user@spark.apache.org) instead. This is not
> a right place to ask questions.
>
> > Reading Parquet Files in Spark Extremely Slow for Large Number of Files?
> > 
> >
> > Key: SPARK-34648
> > URL: https://issues.apache.org/jira/browse/SPARK-34648
> > Project: Spark
> >  Issue Type: Question
> >  Components: SQL
> >Affects Versions: 2.3.0
> >Reporter: Pankaj Bhootra
> >Priority: Major
> >
> > Hello Team
> > I am new to Spark and this question may be a possible duplicate of the
> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
> > We have a large dataset partitioned by calendar date, and within each
> date partition, we are storing the data as *parquet* files in 128 parts.
> > We are trying to run aggregation on this dataset for 366 dates at a time
> with Spark SQL on spark version 2.3.0, hence our Spark job is reading
> 366*128=46848 partitions, all of which are parquet files. There is
> currently no *_metadata* or *_common_metadata* file(s) available for this
> dataset.
> > The problem we are facing is that when we try to run
> *spark.read.parquet* on the above 46848 partitions, our data reads are
> extremely slow. It takes a long time to run even a simple map task (no
> shuffling) without any aggregation or group by.
> > I read through the above issue and I think I perhaps generally
> understand the ideas around *_common_metadata* file. But the above issue
> was raised for Spark 1.3.1 and for Spark 2.3.0, I have not found any
> documentation related to this metadata file so far.
> > I would like to clarify:
> >  # What's the latest, best practice for reading large number of parquet
> files efficiently?
> >  # Does this involve using any additional options with
> spark.read.parquet? How would that work?
> >  # Are there other possible reasons for slow data reads apart from
> reading metadata for every part? We are basically trying to migrate our
> existing spark pipeline from using csv files to parquet, but from my
> hands-on so far, it seems that parquet's read time is slower than csv? This
> seems contradictory to popular opinion that parquet performs better in
> terms of both computation and storage?
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>

Fwd: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-06 Thread Pankaj Bhootra

Hello Team

I am new to Spark and this question may be a possible duplicate of the
issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347

We have a large dataset partitioned by calendar date, and within each date
partition, we are storing the data as *parquet* files in 128 parts.

We are trying to run aggregation on this dataset for 366 dates at a time
with Spark SQL on spark version 2.3.0, hence our Spark job is reading
366*128=46848 partitions, all of which are parquet files. There is
currently no *_metadata* or *_common_metadata* file(s) available for this
dataset.

The problem we are facing is that when we try to run *spark.read.parquet* on
the above 46848 partitions, our data reads are extremely slow. It takes a
long time to run even a simple map task (no shuffling) without any
aggregation or group by.

I read through the above issue and I think I perhaps generally understand
the ideas around *_common_metadata* file. But the above issue was raised
for Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation
related to this metadata file so far.

I would like to clarify:

   1. What's the latest, best practice for reading large number of parquet
   files efficiently?
   2. Does this involve using any additional options with
   spark.read.parquet? How would that work?
   3. Are there other possible reasons for slow data reads apart from
   reading metadata for every part? We are basically trying to migrate our
   existing spark pipeline from using csv files to parquet, but from my
   hands-on so far, it seems that parquet's read time is slower than csv? This
   seems contradictory to popular opinion that parquet performs better in
   terms of both computation and storage?


Thanks
Pankaj Bhootra



-- Forwarded message -
From: Takeshi Yamamuro (Jira) 
Date: Sat, 6 Mar 2021, 20:02
Subject: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark
Extremely Slow for Large Number of Files?
To: 



[
https://issues.apache.org/jira/browse/SPARK-34648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296528#comment-17296528
]

Takeshi Yamamuro commented on SPARK-34648:
--

Please use the mailing list (user@spark.apache.org) instead. This is not a
right place to ask questions.

> Reading Parquet Files in Spark Extremely Slow for Large Number of Files?
> 
>
> Key: SPARK-34648
> URL: https://issues.apache.org/jira/browse/SPARK-34648
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Pankaj Bhootra
>Priority: Major
>
> Hello Team
> I am new to Spark and this question may be a possible duplicate of the
issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
> We have a large dataset partitioned by calendar date, and within each
date partition, we are storing the data as *parquet* files in 128 parts.
> We are trying to run aggregation on this dataset for 366 dates at a time
with Spark SQL on spark version 2.3.0, hence our Spark job is reading
366*128=46848 partitions, all of which are parquet files. There is
currently no *_metadata* or *_common_metadata* file(s) available for this
dataset.
> The problem we are facing is that when we try to run
*spark.read.parquet* on the above 46848 partitions, our data reads are
extremely slow. It takes a long time to run even a simple map task (no
shuffling) without any aggregation or group by.
> I read through the above issue and I think I perhaps generally understand
the ideas around *_common_metadata* file. But the above issue was raised
for Spark 1.3.1 and for Spark 2.3.0, I have not found any documentation
related to this metadata file so far.
> I would like to clarify:
>  # What's the latest, best practice for reading large number of parquet
files efficiently?
>  # Does this involve using any additional options with
spark.read.parquet? How would that work?
>  # Are there other possible reasons for slow data reads apart from
reading metadata for every part? We are basically trying to migrate our
existing spark pipeline from using csv files to parquet, but from my
hands-on so far, it seems that parquet's read time is slower than csv? This
seems contradictory to popular opinion that parquet performs better in
terms of both computation and storage?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

Fwd: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

3 matches

Site Navigation

Mail list logo

Footer information