On 22 Oct 2016, at 00:48, Chetan Khatri 
<ckhatriman...@gmail.com<mailto:ckhatriman...@gmail.com>> wrote:

Hello Cheng,

Thank you for response.

I am using spark 1.6.1, i am writing around 350 gz parquet part files for 
single table. Processed around 180 GB of Data using Spark.

Are you writing to GCS storage to to the local HDD?

Regarding options to set, for performance reads against object store hosted 
parquet data, also go for

spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false






On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian 
<lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:

What version of Spark are you using and how many output files does the job 
writes out?

By default, Spark versions before 1.6 (not including) writes Parquet summary 
files when committing the job. This process reads footers from all Parquet 
files in the destination directory and merges them together. This can be 
particularly bad if you are appending a small amount of data to a large 
existing Parquet dataset.

If that's the case, you may disable Parquet summary files by setting Hadoop 
configuration " parquet.enable.summary-metadata" to false.


Now I'm a bit mixed up. Should that be 
spark.sql.parquet.enable.summary-metadata =false?


We've disabled it by default since 1.6.0

Cheng

On 10/21/16 1:47 PM, Chetan Khatri wrote:
Hello Spark Users,

I am writing around 10 GB of Processed Data to Parquet where having 1 TB of HDD 
and 102 GB of RAM, 16 vCore machine on Google Cloud.

Every time, i write to parquet. it shows on Spark UI that stages succeeded but 
on spark shell it hold context on wait mode for almost 10 mins. then it clears 
broadcast, accumulator shared variables.

Can we sped up this thing ?

Thanks.

--
Yours Aye,
Chetan Khatri.
M.+91 76666 80574
Data Science Researcher
INDIA

​​Statement of Confidentiality
————————————————————————————
The contents of this e-mail message and any attachments are confidential and 
are intended solely for addressee. The information may also be legally 
privileged. This transmission is sent in trust, for the sole purpose of 
delivery to the intended recipient. If you have received this transmission in 
error, any use, reproduction or dissemination of this transmission is strictly 
prohibited. If you are not the intended recipient, please immediately notify 
the sender by reply e-mail or phone and delete this message and its 
attachments, if any.​​




--
Yours Aye,
Chetan Khatri.
M.+91 76666 80574
Data Science Researcher
INDIA

​​Statement of Confidentiality
————————————————————————————
The contents of this e-mail message and any attachments are confidential and 
are intended solely for addressee. The information may also be legally 
privileged. This transmission is sent in trust, for the sole purpose of 
delivery to the intended recipient. If you have received this transmission in 
error, any use, reproduction or dissemination of this transmission is strictly 
prohibited. If you are not the intended recipient, please immediately notify 
the sender by reply e-mail or phone and delete this message and its 
attachments, if any.​​

Reply via email to