How to run multiple groupBy's/reduceByKey

2013-11-28 Thread ioannis.deligiannis
Hi, I would like to do multiple groupBy's in an RDD followed by a single reduce. In Java, I would need to overwrite type safety if I want this to by done multiple type (dynamic). Is there such an example? Would the following work? JavaPairRDDString, ListRow ret = all.groupBy(new

How to start spark Workers with fixed number of Cores?

2013-11-19 Thread ioannis.deligiannis
I am trying to start 12 workers with 2 cores on each Node using the following: In spark-env.sh (copied in every slave) I have set: SPARK_WORKER_INSTANCES=12 SPARK_WORKER_CORES=2 I start Scala console with: SPARK_WORKER_CORES=2 SPARK_MEM=3g MASTER=spark://x:7077

How to efficiently manage resources across a cluster and avoid GC overhead exceeded errors?

2013-11-18 Thread ioannis.deligiannis
Hi, I have a cluster of 20 servers, each having 24 cores and 30GB of RAM allocated to Spark. Spark runs in a STANDALONE mode. I am trying to load some 200+GB files and cache the rows using .cache(). What I would like to do is the following: (ATM from the scala console) -Evenly load the files