Increasing number of partitions on data file solved the problem.

On 6 June 2014 18:46, Oleg Proudnikov <> wrote:

> Additional observation - the map and mapValues are pipelined and executed
> - as expected - in pairs. This means that there is a simple sequence of
> steps - first read from Cassandra and then processing for each value of K.
> This is the exact behaviour of a normal Java loop with these two steps
> inside. I understand that this eliminates batch loading first and pile up
> of massive text arrays.
> Also the keys are relatively evenly distributed across Executors.
> The question is - why is this still so slow? I would appreciate any
> suggestions on where to focus my search.
> Thank you,
> Oleg
> On 6 June 2014 16:24, Oleg Proudnikov <> wrote:
>> Hi All,
>> I am passing Java static methods into RDD transformations map and
>> mapValues. The first map is from a simple string K into a (K,V) where V is
>> a Java ArrayList of large text strings, 50K each, read from Cassandra.
>> MapValues does processing of these text blocks into very small ArrayLists.
>> The code runs quite slow compared to running it in parallel on the same
>> servers from plain Java.
>> I gave the same heap to Executors and Java. Does java run slower under
>> Spark or do I suffer from excess heap pressure or am I missing something?
>> Thank you for any insight,
>> Oleg
> --
> Kind regards,
> Oleg

Kind regards,


Reply via email to