After submitting the job, if you do a ps aux | grep spark-submit then you
can see all JVM params. Are you using the highlevel consumer (receiver
based) for receiving data from Kafka? In that case if your throughput is
high and the processing delay exceeds batch interval then you will hit this
memory issues as the data will keep on receiving and is dumped to memory.
You can set StorageLevel to MEMORY_AND_DISK (but it slows things down).
Another alternate will be to use the lowlevel kafka consumer
<https://github.com/dibbhatt/kafka-spark-consumer> or to use the
non-receiver based directStream
<https://spark.apache.org/docs/1.3.1/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers>
that comes up with spark.

Thanks
Best Regards

On Wed, May 27, 2015 at 11:51 AM, Ji ZHANG <zhangj...@gmail.com> wrote:

> Hi,
>
> I'm using Spark Streaming 1.3 on CDH5.1 with yarn-cluster mode. I find out
> that YARN is killing the driver and executor process because of excessive
> use of memory. Here's something I tried:
>
> 1. Xmx is set to 512M and the GC looks fine (one ygc per 10s), so the
> extra memory is not used by heap.
> 2. I set the two memoryOverhead params to 1024 (default is 384), but the
> memory just keeps growing and then hits the limit.
> 3. This problem is not shown in low-throughput jobs, neither in standalone
> mode.
> 4. The test job just receives messages from Kafka, with batch interval of
> 1, do some filtering and aggregation, and then print to executor logs. So
> it's not some 3rd party library that causes the 'leak'.
>
> Spark 1.3 is built by myself, with correct hadoop versions.
>
> Any ideas will be appreciated.
>
> Thanks.
>
> --
> Jerry
>

Reply via email to