Hi,

The default write format in SPARK is parquet. And I have never faced any
issues writing over a billion records in SPARK. Are you using
virtualization by any chance or an obsolete hard disk or Intel Celeron may
be?

Regards,
Gourav Sengupta

On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna <sumit.kha...@askme.in> wrote:

> Hey,
>
> master=yarn
> mode=cluster
>
> spark.executor.memory=8g
> spark.rpc.netty.dispatcher.numThreads=2
>
> All the POC on a single node cluster. the biggest bottle neck being :
>
> 1.8 hrs to save 500k records as a parquet file/dir executing this command :
>
> df.write.format("parquet").mode("overwrite").save(hdfspathTemp)
>
>
> No doubt, the whole execution plan gets triggered on this write / save
> action. But is it the right command / set of params to save a dataframe?
>
> essentially I am doing an upsert by pulling in data from hdfs and then
> updating it with the delta changes of the current run. But not sure if
> write itself takes that much time or some optimization is needed for
> upsert. (I have that asked as another question altogether).
>
> Thanks,
> Sumit
>
>

Reply via email to