Re: sc.parallelize(512k items) doesn't always use 64 executors

2015-07-30 Thread Konstantinos Kougios
yes,thanks, that sorted out the issue. On 30/07/15 09:26, Akhil Das wrote: sc.parallelize takes a second parameter which is the total number of partitions, are you using that? Thanks Best Regards On Wed, Jul 29, 2015 at 9:27 PM, Kostas Kougios kostas.koug...@googlemail.com

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-13 Thread Konstantinos Kougios
yes YARN was terminating the executor because the off heap memory limit was exceeded. On 13/07/15 06:55, Ruslan Dautkhanov wrote: the executor receives a SIGTERM (from whom???) From YARN Resource Manager. Check if yarn fair scheduler preemption and/or speculative execution are turned on,

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-13 Thread Konstantinos Kougios
it was the memoryOverhead. It runs ok with more of that, but do you know which libraries could affect this? I find it strange that it needs 4g for a task that processes some xml files. The task themselfs require less Xmx. Cheers On 13/07/15 06:29, Jong Wook Kim wrote: Based on my

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-13 Thread Konstantinos Kougios
of memory. e.g. the billion laughs xml: https://en.wikipedia.org/wiki/Billion_laughs -Ewan On 13/07/15 10:11, Konstantinos Kougios wrote: it was the memoryOverhead. It runs ok with more of that, but do you know which libraries could affect this? I find it strange that it needs 4g for a task

Re: is it possible to disable -XX:OnOutOfMemoryError=kill %p for the executors?

2015-07-08 Thread Konstantinos Kougios
seems you're correct: 2015-07-07 17:21:27,245 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=38506,containerID=container_1436262805092_0022_01_03] is running be yond virtual memory limits. Current usage: 4.3 GB of 4.5 GB

binaryFiles() for 1 million files, too much memory required

2015-07-01 Thread Konstantinos Kougios
Once again I am trying to read a directory tree using binary files. My directory tree has a root dir ROOTDIR and subdirs where the files are located, i.e. ROOTDIR/1 ROOTDIR/2 ROOTDIR/.. ROOTDIR/100 A total of 1 mil files split into 100 sub dirs Using binaryFiles requires too much memory on

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
Hi Marchelo, The collected data are collected in say class C. c.id is the id of each of those data. But that id might appear more than once in those 1mil xml files, so I am doing a reduceByKey(). Even if I had multiple binaryFile RDD's, wouldn't I have to ++ those in order to correctly

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
Now I am profiling the executor. There seems to be a memory leak. 20 mins after the run there were: 157k byte[] allocated for 75MB. 519k java.lang.ref.Finalizer for 31MB, 291k java.util.zip.Inflater for 17MB 487k java.util.zip.ZStreamRef for 11MB An hour after the run I got : 186k byte[]

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
:01, Konstantinos Kougios wrote: Now I am profiling the executor. There seems to be a memory leak. 20 mins after the run there were: 157k byte[] allocated for 75MB. 519k java.lang.ref.Finalizer for 31MB, 291k java.util.zip.Inflater for 17MB 487k java.util.zip.ZStreamRef for 11MB An hour after

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
after 2h of running, now I got a 10GB long[], 1.3mil instances of long[] So probably information about the files again. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Konstantinos Kougios
Thanks, did that and now I am getting an out of memory. But I am not sure where this occurs. It can't be on the spark executor as I have 28GB allocated to it. It is not the driver because I run this locally and monitor it via jvisualvm. Unfortunately I can't jmx-monitor hadoop. From the

Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Konstantinos Kougios
No luck I am afraid. After giving the namenode 16GB of RAM, I am still getting an out of mem exception, kind of different one: 15/06/08 15:35:52 ERROR yarn.ApplicationMaster: User class threw exception: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at

Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Konstantinos Kougios
:///path/to/files/*).count() in the spark-shell and verify that part works? Ewan -Original Message- From: Konstantinos Kougios [mailto:kostas.koug...@googlemail.com] Sent: 08 June 2015 15:40 To: Ewan Leith; user@spark.apache.org Subject: Re: spark timesout maybe due to binaryFiles