Increasing number of partitions on data file solved the problem.

On 6 June 2014 18:46, Oleg Proudnikov <oleg.proudni...@gmail.com> wrote:

> Additional observation - the map and mapValues are pipelined and executed
> - as expected - in pairs. This means that there is a simple sequence of
> steps - first read from Cassandra and then processing for each value of K.
> This is the exact behaviour of a normal Java loop with these two steps
> inside. I understand that this eliminates batch loading first and pile up
> of massive text arrays.
>
> Also the keys are relatively evenly distributed across Executors.
>
> The question is - why is this still so slow? I would appreciate any
> suggestions on where to focus my search.
>
> Thank you,
> Oleg
>
>
>
> On 6 June 2014 16:24, Oleg Proudnikov <oleg.proudni...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am passing Java static methods into RDD transformations map and
>> mapValues. The first map is from a simple string K into a (K,V) where V is
>> a Java ArrayList of large text strings, 50K each, read from Cassandra.
>> MapValues does processing of these text blocks into very small ArrayLists.
>>
>> The code runs quite slow compared to running it in parallel on the same
>> servers from plain Java.
>>
>> I gave the same heap to Executors and Java. Does java run slower under
>> Spark or do I suffer from excess heap pressure or am I missing something?
>>
>> Thank you for any insight,
>> Oleg
>>
>>
>
>
> --
> Kind regards,
>
> Oleg
>
>


-- 
Kind regards,

Oleg

Reply via email to