I'm not sure what you mean? I didn't do anything specifically to partition the columns On Nov 14, 2015 00:38, "Davies Liu" <dav...@databricks.com> wrote:
> Do you have partitioned columns? > > On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar <rokros...@gmail.com> wrote: > > I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions > into a > > parquet file on HDFS. I've got a few hundred nodes in the cluster, so for > > the size of file this is way over-provisioned (I've tried it with fewer > > partitions and fewer nodes, no obvious effect). I was expecting the dump > to > > disk to be very fast -- the DataFrame is cached in memory and contains > just > > 14 columns (13 are floats and one is a string). When I write it out in > json > > format, this is indeed reasonably fast (though it still takes a few > minutes, > > which is longer than I would expect). > > > > However, when I try to write a parquet file it takes way longer -- the > first > > set of tasks finishes in a few minutes, but the subsequent tasks take > more > > than twice as long or longer. In the end it takes over half an hour to > write > > the file. I've looked at the disk I/O and cpu usage on the compute nodes > and > > it looks like the processors are fully loaded while the disk I/O is > > essentially zero for long periods of time. I don't see any obvious > garbage > > collection issues and there are no problems with memory. > > > > Any ideas on how to debug/fix this? > > > > Thanks! > > > > >