How are you writing it out? Can you post some code?

Regards
Sab
On 14-Nov-2015 5:21 am, "Rok Roskar" <rokros...@gmail.com> wrote:

> I'm not sure what you mean? I didn't do anything specifically to partition
> the columns
> On Nov 14, 2015 00:38, "Davies Liu" <dav...@databricks.com> wrote:
>
>> Do you have partitioned columns?
>>
>> On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar <rokros...@gmail.com> wrote:
>> > I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions
>> into a
>> > parquet file on HDFS. I've got a few hundred nodes in the cluster, so
>> for
>> > the size of file this is way over-provisioned (I've tried it with fewer
>> > partitions and fewer nodes, no obvious effect). I was expecting the
>> dump to
>> > disk to be very fast -- the DataFrame is cached in memory and contains
>> just
>> > 14 columns (13 are floats and one is a string). When I write it out in
>> json
>> > format, this is indeed reasonably fast (though it still takes a few
>> minutes,
>> > which is longer than I would expect).
>> >
>> > However, when I try to write a parquet file it takes way longer -- the
>> first
>> > set of tasks finishes in a few minutes, but the subsequent tasks take
>> more
>> > than twice as long or longer. In the end it takes over half an hour to
>> write
>> > the file. I've looked at the disk I/O and cpu usage on the compute
>> nodes and
>> > it looks like the processors are fully loaded while the disk I/O is
>> > essentially zero for long periods of time. I don't see any obvious
>> garbage
>> > collection issues and there are no problems with memory.
>> >
>> > Any ideas on how to debug/fix this?
>> >
>> > Thanks!
>> >
>> >
>>
>

Reply via email to