Reducer doesn't operate in parallel

2016-03-03 Thread octavian.ganea
Hi, I've seen in a few cases that when calling a reduce operation, it is executed sequentially rather than in parallel. For example, I have the following code that performs a simple word counting on very big data using hashmaps (instead of (word,1) pairs that would overflow the memory at

broadcast params to workers at the very beginning

2016-01-09 Thread octavian.ganea
Hi, In my app, I have a Params scala object that keeps all the specific (hyper)parameters of my program. This object is read in each worker. I would like to be able to pass specific values of the Params' fields in the command line. One way would be to simply update all the fields of the Params

Re: Shuffle strange error

2015-06-05 Thread octavian.ganea
Solved, is SPARK_PID_DIR from spark-env.sh. Changing this directory from /tmp to smthg different actually changed the error that I got, now showing where the actual error was coming from (a null pointer in my program). The first error was not helpful at all though. -- View this message in

Shuffle strange error

2015-06-05 Thread octavian.ganea
Hi all, I'm using spark 1.3.1 and ran the following code: sc.textFile(path) .map(line = (getEntId(line), line)) .persist(StorageLevel.MEMORY_AND_DISK) .groupByKey .flatMap(x = func(x)) .reduceByKey((a,b) = (a + b).toShort) I get the following error in

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread octavian.ganea
I was tried using reduceByKey, without success. I also tried this: rdd.persist(MEMORY_AND_DISK).flatMap(...).reduceByKey . However, I got the same error as before, namely the error described here:

flatMap output on disk / flatMap memory overhead

2015-06-01 Thread octavian.ganea
Hi, Is there any way to force the output RDD of a flatMap op to be stored in both memory and disk as it is computed ? My RAM would not be able to fit the entire output of flatMap, so it really needs to starts using disk after the RAM gets full. I didn't find any way to force this. Also, what

map - reduce only with disk

2015-06-01 Thread octavian.ganea
Dear all, Does anyone know how can I force Spark to use only the disk when doing a simple flatMap(..).groupByKey.reduce(_ + _) ? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/map-reduce-only-with-disk-tp23102.html Sent from the Apache Spark

Re: Lost task - connection closed

2015-01-26 Thread octavian.ganea
Here is the first error I get at the executors: 15/01/26 17:27:04 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[handle-message-executor-16,5,main] java.lang.StackOverflowError at

Lost task - connection closed

2015-01-25 Thread octavian.ganea
Hi, I am running a program that executes map-reduce jobs in a loop. The first time the loop runs, everything is ok. After that, it starts giving the following error, first it gives it for one task, then for more tasks and eventually the entire program fails: 15/01/26 01:41:25 WARN

Re: How to share a NonSerializable variable among tasks in the same worker node?

2015-01-21 Thread octavian.ganea
In case someone has the same problem: The singleton hack works for me sometimes, sometimes it doesn't in spark 1.2.0, that is, sometimes I get nullpointerexception. Anyway, if you really need to work with big indexes and you want to have the smallest amount of communication between master and

Re: Avoid broacasting huge variables

2015-01-18 Thread octavian.ganea
The singleton hack works very different in spark 1.2.0 (it does not work if the program has multiple map-reduce jobs in the same program). I guess there should be an official documentation on how to have each machine/node do an init step locally before executing any other instructions (e.g.

Reducer memory exceeded

2015-01-18 Thread octavian.ganea
Hi, Please help me with this problem. I would really appreciate your help ! I am using spark 1.2.0. I have a map-reduce job written in spark in the following way: val sumW = splittedTrainingDataRDD.map(localTrainingData = LocalSGD(w, localTrainingData, numeratorCtEta, numitorCtEta, regularizer,

Re: How to share a NonSerializable variable among tasks in the same worker node?

2015-01-18 Thread octavian.ganea
The singleton hack works very different in spark 1.2.0 (it does not work if the program has multiple map-reduce jobs in the same program). I guess there should be an official documentation on how to have each machine/node do an init step locally before executing any other instructions (e.g.

Re: How to use memcached with spark

2015-01-12 Thread octavian.ganea
I am trying to use it, but without success. Any sample code that works with Spark would be highly appreciated. :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-memcached-with-spark-tp13409p21103.html Sent from the Apache Spark User List mailing

Re: Bug in Accumulators...

2014-11-22 Thread octavian.ganea
One month later, the same problem. I think that someone (e.g. inventors of Spark) should show us a big example of how to use accumulators. I can start telling that we need to see an example of the following form: val accum = sc.accumulator(0) sc.parallelize(Array(1, 2, 3, 4)).map(x =

Re: Bug in Accumulators...

2014-11-22 Thread octavian.ganea
Hi Sowen, You're right, that example works, but look what example does not work for me: object Main { def main(args: Array[String]) { val conf = new SparkConf().setAppName(name) val sc = new SparkContext(conf) val accum = sc.accumulator(0) for (i - 1 to 10) { val

Re: Bug in Accumulators...

2014-10-27 Thread octavian.ganea
I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works if I run it in local mode! ) If I put the accumulator inside the for loop, everything will work fine. I guess the bug is that an accumulator can be applied to JUST one RDD. Still another undocumented 'feature' of Spark

Re: Bug in Accumulators...

2014-10-26 Thread octavian.ganea
Sorry, I forgot to say that this gives the above error just when run on a cluster, not in local mode. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p17277.html Sent from the Apache Spark User List mailing list archive at

Re: Accumulators : Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-10-26 Thread octavian.ganea
Hi Akhil, Please see this related message. http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-td17263.html I am curious if this works for you also. -- View this message in context:

NullPointerException when using Accumulators on cluster

2014-10-25 Thread octavian.ganea
Hi, I have a simple accumulator that needs to be passed to a foo() function inside a map job: val myCounter = sc.accumulator(0) val myRDD = sc.textFile(inputpath) // :spark.RDD[String] myRDD.flatMap(line = foo(line)) def foo(line: String) = { myCounter += 1 // line throwing

Accumulators : Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-10-25 Thread octavian.ganea
Hi all, I tried to use accumulators without any success so far. My code is simple: val sc = new SparkContext(conf) val accum = sc.accumulator(0) val partialStats = sc.textFile(f.getAbsolutePath()) .map(line = { val key = line.split(\t).head; (key , line)} )

Bug in Accumulators...

2014-10-25 Thread octavian.ganea
There is for sure a bug in the Accumulators code. More specifically, the following code works well as expected: def main(args: Array[String]) { val conf = new SparkConf().setAppName(EL LBP SPARK) val sc = new SparkContext(conf) val accum = sc.accumulator(0)

Re: Avoid broacasting huge variables

2014-09-21 Thread octavian.ganea
Using mapPartitions and passing the big index object as a parameter to it was not the best option, given the size of the big object and my RAM. The workers died before starting the actual computation. Anyway, creating a singleton object worked for me: