What is a taskBinary for a ShuffleMapTask? What is its purpose?

2015-09-21 Thread Muler
Hi,

What is the purpose of the taskBinary for a ShuffleMapTask? What does it
contain and how is it useful? Is it the representation of all the RDD
operations that will be applied for the partition that task will be
processing? (in the case below the task will process stage 0, partition 0)
If it is not a representation of the RDD operations inside the stage, then
how does a task know the operations that it should apply on its partition?

Thanks,

*{ShuffleMapTask@9034} "ShuffleMapTask(0, 0)"*
 *taskBinary* = {TorrentBroadcast@8204} "Broadcast(1)"
  org$apache$spark$broadcast$TorrentBroadcast$$evidence$1 =
{ClassTag$$anon$1@8470} "Array[byte]"
  org$apache$spark$broadcast$TorrentBroadcast$$broadcastId =
{BroadcastBlockId@8249} "broadcast_1"
  numBlocks = 1
  _value = null
  org$apache$spark$broadcast$TorrentBroadcast$$compressionCodec = {Some@8468}
"Some(org.apache.spark.io.SnappyCompressionCodec@7ede98e1)"
  blockSize = 4194304
  bitmap$trans$0 = false
  id = 1
  org$apache$spark$broadcast$Broadcast$$_destroySite = {String@5327} ""
  _isValid = true
  org$apache$spark$Logging$$log_ = null
* partition* = {HadoopPartition@9049}
* locs* = {$colon$colon@9050} "::" size = 1
* preferredLocs* = {ArrayBuffer@9051} "ArrayBuffer" size = 1
* org$apache$spark$Logging$$log_* = null
* stageId* = 0
* partitionId* = 0
* taskMemoryManager* = null
* epoch* = -1
* metrics* = {None$@5261} "None"
* _executorDeserializeTime* = 0
* context* = null
* taskThread* = null
* _killed* = false

etc..
etc..


Difference between sparkDriver and "executor ID driver"

2015-09-15 Thread Muler
I'm running Spark in local mode and getting these two log messages who
appear to be similar. I want to understand what each is doing:


   1. [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started
   service 'sparkDriver' on port 60782.
   2. [main] executor.Executor (Logging.scala:logInfo(59)) - Starting
   executor ID driver on host localhost

1. is created using:

val actorSystemName = if (isDriver) driverActorSystemName else
executorActorSystemName

val rpcEnv = RpcEnv.create(actorSystemName, hostname, port, conf,
securityManager)
val actorSystem = rpcEnv.asInstanceOf[AkkaRpcEnv].actorSystem

2. is created when:

 _taskScheduler.start()


What is the difference and what does each do?


Error:(46, 66) not found: type SparkFlumeProtocol

2015-08-25 Thread Muler
I'm trying to build Spark using Intellij on Windows. But I'm repeatedly
getting this error

spark-master\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala
Error:(46, 66) not found: type SparkFlumeProtocol
  val transactionTimeout: Int, val backOffInterval: Int) extends
SparkFlumeProtocol with Logging {
 ^
Error:(72, 39) not found: type EventBatch
  override def getEventBatch(n: Int): EventBatch = {
  ^
Error:(87, 13) not found: type EventBatch
new EventBatch(Spark sink has been stopped!, ,
java.util.Collections.emptyList())
^

I had the same error when using Linux, bit there I solved it by right
clicking on the flume-sink - maven - generate sources and update folders.
But on Windows, it doesn't seem to work. Any ideas?

Thanks,


Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Muler
Consider the classic word count application over a 4 node cluster with a
sizable working data. What makes Spark ran faster than MapReduce
considering that Spark also has to write to disk during shuffle?


Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread Muler
Spark is an in-memory engine and attempts to do computation in-memory.
Tachyon is memory-centeric distributed storage, OK, but how would that help
ran Spark faster?


Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Muler
Thanks!

On Wed, Aug 5, 2015 at 5:24 PM, Saisai Shao sai.sai.s...@gmail.com wrote:

 Yes, finally shuffle data will be written to disk for reduce stage to
 pull, no matter how large you set to shuffle memory fraction.

 Thanks
 Saisai

 On Thu, Aug 6, 2015 at 7:50 AM, Muler mulugeta.abe...@gmail.com wrote:

 thanks, so if I have enough large memory (with enough
 spark.shuffle.memory) then shuffle (in-memory shuffle) spill doesn't happen
 (per node) but still shuffle data has to be ultimately written to disk so
 that reduce stage pulls if across network?

 On Wed, Aug 5, 2015 at 4:40 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Hi Muler,

 Shuffle data will be written to disk, no matter how large memory you
 have, large memory could alleviate shuffle spill where temporary file will
 be generated if memory is not enough.

 Yes, each node writes shuffle data to file and pulled from disk in
 reduce stage from network framework (default is Netty).

 Thanks
 Saisai

 On Thu, Aug 6, 2015 at 7:10 AM, Muler mulugeta.abe...@gmail.com wrote:

 Hi,

 Consider I'm running WordCount with 100m of data on 4 node cluster.
 Assuming my RAM size on each node is 200g and i'm giving my executors 100g
 (just enough memory for 100m data)


1. If I have enough memory, can Spark 100% avoid writing to disk?
2. During shuffle, where results have to be collected from nodes,
does each node write to disk and then the results are pulled from disk? 
 If
not, what is the API that is being used to pull data from nodes across 
 the
cluster? (I'm thinking what Scala or Java packages would allow you to 
 read
in-memory data from other machines?)

 Thanks,







Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Muler
Hi,

Consider I'm running WordCount with 100m of data on 4 node cluster.
Assuming my RAM size on each node is 200g and i'm giving my executors 100g
(just enough memory for 100m data)


   1. If I have enough memory, can Spark 100% avoid writing to disk?
   2. During shuffle, where results have to be collected from nodes, does
   each node write to disk and then the results are pulled from disk? If not,
   what is the API that is being used to pull data from nodes across the
   cluster? (I'm thinking what Scala or Java packages would allow you to read
   in-memory data from other machines?)

Thanks,


Re: Newbie question: can shuffle avoid writing and reading from disk?

2015-08-05 Thread Muler
thanks, so if I have enough large memory (with enough spark.shuffle.memory)
then shuffle (in-memory shuffle) spill doesn't happen (per node) but still
shuffle data has to be ultimately written to disk so that reduce stage
pulls if across network?

On Wed, Aug 5, 2015 at 4:40 PM, Saisai Shao sai.sai.s...@gmail.com wrote:

 Hi Muler,

 Shuffle data will be written to disk, no matter how large memory you have,
 large memory could alleviate shuffle spill where temporary file will be
 generated if memory is not enough.

 Yes, each node writes shuffle data to file and pulled from disk in reduce
 stage from network framework (default is Netty).

 Thanks
 Saisai

 On Thu, Aug 6, 2015 at 7:10 AM, Muler mulugeta.abe...@gmail.com wrote:

 Hi,

 Consider I'm running WordCount with 100m of data on 4 node cluster.
 Assuming my RAM size on each node is 200g and i'm giving my executors 100g
 (just enough memory for 100m data)


1. If I have enough memory, can Spark 100% avoid writing to disk?
2. During shuffle, where results have to be collected from nodes,
does each node write to disk and then the results are pulled from disk? If
not, what is the API that is being used to pull data from nodes across the
cluster? (I'm thinking what Scala or Java packages would allow you to read
in-memory data from other machines?)

 Thanks,