Re: Using Java functions in Spark

2014-06-07 Thread Oleg Proudnikov
Increasing number of partitions on data file solved the problem.


On 6 June 2014 18:46, Oleg Proudnikov oleg.proudni...@gmail.com wrote:

 Additional observation - the map and mapValues are pipelined and executed
 - as expected - in pairs. This means that there is a simple sequence of
 steps - first read from Cassandra and then processing for each value of K.
 This is the exact behaviour of a normal Java loop with these two steps
 inside. I understand that this eliminates batch loading first and pile up
 of massive text arrays.

 Also the keys are relatively evenly distributed across Executors.

 The question is - why is this still so slow? I would appreciate any
 suggestions on where to focus my search.

 Thank you,
 Oleg



 On 6 June 2014 16:24, Oleg Proudnikov oleg.proudni...@gmail.com wrote:

 Hi All,

 I am passing Java static methods into RDD transformations map and
 mapValues. The first map is from a simple string K into a (K,V) where V is
 a Java ArrayList of large text strings, 50K each, read from Cassandra.
 MapValues does processing of these text blocks into very small ArrayLists.

 The code runs quite slow compared to running it in parallel on the same
 servers from plain Java.

 I gave the same heap to Executors and Java. Does java run slower under
 Spark or do I suffer from excess heap pressure or am I missing something?

 Thank you for any insight,
 Oleg




 --
 Kind regards,

 Oleg




-- 
Kind regards,

Oleg


Using Java functions in Spark

2014-06-06 Thread Oleg Proudnikov
Hi All,

I am passing Java static methods into RDD transformations map and
mapValues. The first map is from a simple string K into a (K,V) where V is
a Java ArrayList of large text strings, 50K each, read from Cassandra.
MapValues does processing of these text blocks into very small ArrayLists.

The code runs quite slow compared to running it in parallel on the same
servers from plain Java.

I gave the same heap to Executors and Java. Does java run slower under
Spark or do I suffer from excess heap pressure or am I missing something?

Thank you for any insight,
Oleg


Re: Using Java functions in Spark

2014-06-06 Thread Oleg Proudnikov
Additional observation - the map and mapValues are pipelined and executed -
as expected - in pairs. This means that there is a simple sequence of steps
- first read from Cassandra and then processing for each value of K. This
is the exact behaviour of a normal Java loop with these two steps inside. I
understand that this eliminates batch loading first and pile up of massive
text arrays.

Also the keys are relatively evenly distributed across Executors.

The question is - why is this still so slow? I would appreciate any
suggestions on where to focus my search.

Thank you,
Oleg



On 6 June 2014 16:24, Oleg Proudnikov oleg.proudni...@gmail.com wrote:

 Hi All,

 I am passing Java static methods into RDD transformations map and
 mapValues. The first map is from a simple string K into a (K,V) where V is
 a Java ArrayList of large text strings, 50K each, read from Cassandra.
 MapValues does processing of these text blocks into very small ArrayLists.

 The code runs quite slow compared to running it in parallel on the same
 servers from plain Java.

 I gave the same heap to Executors and Java. Does java run slower under
 Spark or do I suffer from excess heap pressure or am I missing something?

 Thank you for any insight,
 Oleg




-- 
Kind regards,

Oleg