Spark hangs when i call parallelize + count on a ArrayListbyte[] having 40k elements

2014-04-23 Thread amit karmakar
Spark hangs after i perform the following operations


ArrayListbyte[] bytesList = new ArrayListbyte[]();
/*
   add 40k entries to bytesList
*/

JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList);
 System.out.println(Count= + rdd.count());


If i add just one entry it works.

It works if i modify,
JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList)
to
JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList, 20);

There is nothing in the logs that can help understand the reason.

What could be reason for this ?


Regards,
Amit Kumar Karmakar


Re: Spark hangs when i call parallelize + count on a ArrayListbyte[] having 40k elements

2014-04-23 Thread Xiangrui Meng
How big is each entry, and how much memory do you have on each
executor? You generated all data on driver and
sc.parallelize(bytesList) will send the entire dataset to a single
executor. You may run into I/O or memory issues. If the entries are
generated, you should create a simple RDD sc.parallelize(0 until 20,
20) and call mapPartitions to generate them in parallel. -Xiangrui

On Wed, Apr 23, 2014 at 9:23 AM, amit karmakar
amit.codenam...@gmail.com wrote:
 Spark hangs after i perform the following operations


 ArrayListbyte[] bytesList = new ArrayListbyte[]();
 /*
add 40k entries to bytesList
 */

 JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList);
  System.out.println(Count= + rdd.count());


 If i add just one entry it works.

 It works if i modify,
 JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList)
 to
 JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList, 20);

 There is nothing in the logs that can help understand the reason.

 What could be reason for this ?


 Regards,
 Amit Kumar Karmakar