How many partitions you generated?
if Millions generated, then there is a huge memory consumed.
> On Oct 26, 2015, at 10:58 AM, Jerry Lam wrote:
>
> Hi guys,
>
> I mentioned that the partitions are generated so I tried to read the
> partition data from it. The driver
Hi Fengdong,
Why it needs more memory at the driver side when there are many partitions? It
seems the implementation can only support use cases for a dozen of partition
when it is over 100, it fails apart. It is also quite slow to initialize the
loading of partition tables when the number of
it seems HadoopFsRelation keeps track of all part files (instead of just
the data directories). i believe this has something to do with parquet
footers but i didnt bother to look more into it. but yet the result is that
driver side it:
1) tries to keep track of all part files in a Map[Path,
Hi Jerry,
Do you have speculation enabled? A write which produces one million files /
output partitions might be using tons of driver memory via the
OutputCommitCoordinator's bookkeeping data structures.
On Sun, Oct 25, 2015 at 5:50 PM, Jerry Lam wrote:
> Hi spark guys,
>
Hi Josh,
No I don't have speculation enabled. The driver took about few hours until
it was OOM. Interestingly, all partitions are generated successfully
(_SUCCESS file is written in the output directory). Is there a reason why
the driver needs so much memory? The jstack revealed that it called
Hi guys,
I mentioned that the partitions are generated so I tried to read the
partition data from it. The driver is OOM after few minutes. The stack
trace is below. It looks very similar to the the jstack above (note on the
refresh method). Thanks!
Name: java.lang.OutOfMemoryError
Message: GC
Hi spark guys,
I think I hit the same issue SPARK-8890
https://issues.apache.org/jira/browse/SPARK-8890. It is marked as resolved.
However it is not. I have over a million output directories for 1 single
column in partitionBy. Not sure if this is a regression issue? Do I need to
set some
Hi guys,
After waiting for a day, it actually causes OOM on the spark driver. I
configure the driver to have 6GB. Note that I didn't call refresh myself.
The method was called when saving the dataframe in parquet format. Also I'm
using partitionBy() on the DataFrameWriter to generate over 1