Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Fengdong Yu
How many partitions you generated? if Millions generated, then there is a huge memory consumed. > On Oct 26, 2015, at 10:58 AM, Jerry Lam wrote: > > Hi guys, > > I mentioned that the partitions are generated so I tried to read the > partition data from it. The driver

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Jerry Lam
Hi Fengdong, Why it needs more memory at the driver side when there are many partitions? It seems the implementation can only support use cases for a dozen of partition when it is over 100, it fails apart. It is also quite slow to initialize the loading of partition tables when the number of

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Koert Kuipers
it seems HadoopFsRelation keeps track of all part files (instead of just the data directories). i believe this has something to do with parquet footers but i didnt bother to look more into it. but yet the result is that driver side it: 1) tries to keep track of all part files in a Map[Path,

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Josh Rosen
Hi Jerry, Do you have speculation enabled? A write which produces one million files / output partitions might be using tons of driver memory via the OutputCommitCoordinator's bookkeeping data structures. On Sun, Oct 25, 2015 at 5:50 PM, Jerry Lam wrote: > Hi spark guys, >

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
Hi Josh, No I don't have speculation enabled. The driver took about few hours until it was OOM. Interestingly, all partitions are generated successfully (_SUCCESS file is written in the output directory). Is there a reason why the driver needs so much memory? The jstack revealed that it called

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
Hi guys, I mentioned that the partitions are generated so I tried to read the partition data from it. The driver is OOM after few minutes. The stack trace is below. It looks very similar to the the jstack above (note on the refresh method). Thanks! Name: java.lang.OutOfMemoryError Message: GC

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
Hi spark guys, I think I hit the same issue SPARK-8890 https://issues.apache.org/jira/browse/SPARK-8890. It is marked as resolved. However it is not. I have over a million output directories for 1 single column in partitionBy. Not sure if this is a regression issue? Do I need to set some

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
Hi guys, After waiting for a day, it actually causes OOM on the spark driver. I configure the driver to have 6GB. Note that I didn't call refresh myself. The method was called when saving the dataframe in parquet format. Also I'm using partitionBy() on the DataFrameWriter to generate over 1