Great! Common sense is very uncommon.

On Fri, Jul 29, 2016 at 8:26 PM, Ewan Leith <ewan.le...@realitymine.com>
wrote:

> If you replace the df.write ….
>
>
>
> With
>
>
>
> df.count()
>
>
>
> in your code you’ll see how much time is taken to process the full
> execution plan without the write output.
>
>
>
> That code below looks perfectly normal for writing a parquet file yes,
> there shouldn’t be any tuning needed for “normal” performance.
>
>
>
> Thanks,
>
> Ewan
>
>
>
> *From:* Sumit Khanna [mailto:sumit.kha...@askme.in]
> *Sent:* 29 July 2016 13:41
> *To:* Gourav Sengupta <gourav.sengu...@gmail.com>
> *Cc:* user <user@spark.apache.org>
> *Subject:* Re: how to save spark files as parquets efficiently
>
>
>
> Hey Gourav,
>
>
>
> Well so I think that it is my execution plan that is at fault. So
> basically df.write as a spark job on localhost:4040/ well being an action
> will include the time taken for all the umpteen transformation on it right?
> All I wanted to know is "what apt env/config params are needed to something
> simple read a dataframe from parquet and save it back as another parquet
> (meaning vanilla load/store no transformation). Is it good enough to simply
> read. and write. in the very format mentioned in spark tutorial docs i.e
>
>
>
> *df.write.format("parquet").mode("overwrite").save(hdfspathTemp) *??
>
>
>
> Thanks,
>
>
>
> On Fri, Jul 29, 2016 at 4:22 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
> Hi,
>
>
> The default write format in SPARK is parquet. And I have never faced any
> issues writing over a billion records in SPARK. Are you using
> virtualization by any chance or an obsolete hard disk or Intel Celeron may
> be?
>
> Regards,
>
> Gourav Sengupta
>
>
>
> On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna <sumit.kha...@askme.in>
> wrote:
>
> Hey,
>
>
>
> master=yarn
>
> mode=cluster
>
>
>
> spark.executor.memory=8g
>
> spark.rpc.netty.dispatcher.numThreads=2
>
>
>
> All the POC on a single node cluster. the biggest bottle neck being :
>
>
>
> 1.8 hrs to save 500k records as a parquet file/dir executing this command :
>
>
>
> *df.write.format("parquet").mode("overwrite").save(hdfspathTemp)*
>
>
>
> No doubt, the whole execution plan gets triggered on this write / save
> action. But is it the right command / set of params to save a dataframe?
>
>
>
> essentially I am doing an upsert by pulling in data from hdfs and then
> updating it with the delta changes of the current run. But not sure if
> write itself takes that much time or some optimization is needed for
> upsert. (I have that asked as another question altogether).
>
>
>
> Thanks,
>
> Sumit
>
>
>
>
>
>
>

Reply via email to