What is a taskBinary for a ShuffleMapTask? What is its purpose?

2015-09-21 Thread Muler
Hi, What is the purpose of the taskBinary for a ShuffleMapTask? What does it contain and how is it useful? Is it the representation of all the RDD operations that will be applied for the partition that task will be processing? (in the case below the task will process stage 0, partition 0) If it

Difference between sparkDriver and "executor ID driver"

2015-09-15 Thread Muler
I'm running Spark in local mode and getting these two log messages who appear to be similar. I want to understand what each is doing: 1. [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'sparkDriver' on port 60782. 2. [main] executor.Executor

Error:(46, 66) not found: type SparkFlumeProtocol

2015-08-25 Thread Muler
I'm trying to build Spark using Intellij on Windows. But I'm repeatedly getting this error spark-master\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala Error:(46, 66) not found: type SparkFlumeProtocol val transactionTimeout: Int, val

Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Muler
Consider the classic word count application over a 4 node cluster with a sizable working data. What makes Spark ran faster than MapReduce considering that Spark also has to write to disk during shuffle?

Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread Muler
Spark is an in-memory engine and attempts to do computation in-memory. Tachyon is memory-centeric distributed storage, OK, but how would that help ran Spark faster?

Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Muler
Thanks! On Wed, Aug 5, 2015 at 5:24 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Yes, finally shuffle data will be written to disk for reduce stage to pull, no matter how large you set to shuffle memory fraction. Thanks Saisai On Thu, Aug 6, 2015 at 7:50 AM, Muler mulugeta.abe

Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Muler
Hi, Consider I'm running WordCount with 100m of data on 4 node cluster. Assuming my RAM size on each node is 200g and i'm giving my executors 100g (just enough memory for 100m data) 1. If I have enough memory, can Spark 100% avoid writing to disk? 2. During shuffle, where results have to

Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Muler
sai.sai.s...@gmail.com wrote: Hi Muler, Shuffle data will be written to disk, no matter how large memory you have, large memory could alleviate shuffle spill where temporary file will be generated if memory is not enough. Yes, each node writes shuffle data to file and pulled from disk in reduce