Re: Using Java functions in Spark
Increasing number of partitions on data file solved the problem. On 6 June 2014 18:46, Oleg Proudnikov oleg.proudni...@gmail.com wrote: Additional observation - the map and mapValues are pipelined and executed - as expected - in pairs. This means that there is a simple sequence of steps - first read from Cassandra and then processing for each value of K. This is the exact behaviour of a normal Java loop with these two steps inside. I understand that this eliminates batch loading first and pile up of massive text arrays. Also the keys are relatively evenly distributed across Executors. The question is - why is this still so slow? I would appreciate any suggestions on where to focus my search. Thank you, Oleg On 6 June 2014 16:24, Oleg Proudnikov oleg.proudni...@gmail.com wrote: Hi All, I am passing Java static methods into RDD transformations map and mapValues. The first map is from a simple string K into a (K,V) where V is a Java ArrayList of large text strings, 50K each, read from Cassandra. MapValues does processing of these text blocks into very small ArrayLists. The code runs quite slow compared to running it in parallel on the same servers from plain Java. I gave the same heap to Executors and Java. Does java run slower under Spark or do I suffer from excess heap pressure or am I missing something? Thank you for any insight, Oleg -- Kind regards, Oleg -- Kind regards, Oleg
Using Java functions in Spark
Hi All, I am passing Java static methods into RDD transformations map and mapValues. The first map is from a simple string K into a (K,V) where V is a Java ArrayList of large text strings, 50K each, read from Cassandra. MapValues does processing of these text blocks into very small ArrayLists. The code runs quite slow compared to running it in parallel on the same servers from plain Java. I gave the same heap to Executors and Java. Does java run slower under Spark or do I suffer from excess heap pressure or am I missing something? Thank you for any insight, Oleg
Re: Using Java functions in Spark
Additional observation - the map and mapValues are pipelined and executed - as expected - in pairs. This means that there is a simple sequence of steps - first read from Cassandra and then processing for each value of K. This is the exact behaviour of a normal Java loop with these two steps inside. I understand that this eliminates batch loading first and pile up of massive text arrays. Also the keys are relatively evenly distributed across Executors. The question is - why is this still so slow? I would appreciate any suggestions on where to focus my search. Thank you, Oleg On 6 June 2014 16:24, Oleg Proudnikov oleg.proudni...@gmail.com wrote: Hi All, I am passing Java static methods into RDD transformations map and mapValues. The first map is from a simple string K into a (K,V) where V is a Java ArrayList of large text strings, 50K each, read from Cassandra. MapValues does processing of these text blocks into very small ArrayLists. The code runs quite slow compared to running it in parallel on the same servers from plain Java. I gave the same heap to Executors and Java. Does java run slower under Spark or do I suffer from excess heap pressure or am I missing something? Thank you for any insight, Oleg -- Kind regards, Oleg