Hey Gourav,

Well so I think that it is my execution plan that is at fault. So basically
df.write as a spark job on localhost:4040/ well being an action will
include the time taken for all the umpteen transformation on it right? All
I wanted to know is "what apt env/config params are needed to something
simple read a dataframe from parquet and save it back as another parquet
(meaning vanilla load/store no transformation). Is it good enough to simply
read. and write. in the very format mentioned in spark tutorial docs i.e

df.write.format("parquet").mode("overwrite").save(hdfspathTemp) ??

Thanks,

On Fri, Jul 29, 2016 at 4:22 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi,
>
> The default write format in SPARK is parquet. And I have never faced any
> issues writing over a billion records in SPARK. Are you using
> virtualization by any chance or an obsolete hard disk or Intel Celeron may
> be?
>
> Regards,
> Gourav Sengupta
>
> On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna <sumit.kha...@askme.in>
> wrote:
>
>> Hey,
>>
>> master=yarn
>> mode=cluster
>>
>> spark.executor.memory=8g
>> spark.rpc.netty.dispatcher.numThreads=2
>>
>> All the POC on a single node cluster. the biggest bottle neck being :
>>
>> 1.8 hrs to save 500k records as a parquet file/dir executing this command
>> :
>>
>> df.write.format("parquet").mode("overwrite").save(hdfspathTemp)
>>
>>
>> No doubt, the whole execution plan gets triggered on this write / save
>> action. But is it the right command / set of params to save a dataframe?
>>
>> essentially I am doing an upsert by pulling in data from hdfs and then
>> updating it with the delta changes of the current run. But not sure if
>> write itself takes that much time or some optimization is needed for
>> upsert. (I have that asked as another question altogether).
>>
>> Thanks,
>> Sumit
>>
>>
>

Reply via email to