What is a taskBinary for a ShuffleMapTask? What is its purpose?
Hi, What is the purpose of the taskBinary for a ShuffleMapTask? What does it contain and how is it useful? Is it the representation of all the RDD operations that will be applied for the partition that task will be processing? (in the case below the task will process stage 0, partition 0) If it is not a representation of the RDD operations inside the stage, then how does a task know the operations that it should apply on its partition? Thanks, *{ShuffleMapTask@9034} "ShuffleMapTask(0, 0)"* *taskBinary* = {TorrentBroadcast@8204} "Broadcast(1)" org$apache$spark$broadcast$TorrentBroadcast$$evidence$1 = {ClassTag$$anon$1@8470} "Array[byte]" org$apache$spark$broadcast$TorrentBroadcast$$broadcastId = {BroadcastBlockId@8249} "broadcast_1" numBlocks = 1 _value = null org$apache$spark$broadcast$TorrentBroadcast$$compressionCodec = {Some@8468} "Some(org.apache.spark.io.SnappyCompressionCodec@7ede98e1)" blockSize = 4194304 bitmap$trans$0 = false id = 1 org$apache$spark$broadcast$Broadcast$$_destroySite = {String@5327} "" _isValid = true org$apache$spark$Logging$$log_ = null * partition* = {HadoopPartition@9049} * locs* = {$colon$colon@9050} "::" size = 1 * preferredLocs* = {ArrayBuffer@9051} "ArrayBuffer" size = 1 * org$apache$spark$Logging$$log_* = null * stageId* = 0 * partitionId* = 0 * taskMemoryManager* = null * epoch* = -1 * metrics* = {None$@5261} "None" * _executorDeserializeTime* = 0 * context* = null * taskThread* = null * _killed* = false etc.. etc..
Difference between sparkDriver and "executor ID driver"
I'm running Spark in local mode and getting these two log messages who appear to be similar. I want to understand what each is doing: 1. [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'sparkDriver' on port 60782. 2. [main] executor.Executor (Logging.scala:logInfo(59)) - Starting executor ID driver on host localhost 1. is created using: val actorSystemName = if (isDriver) driverActorSystemName else executorActorSystemName val rpcEnv = RpcEnv.create(actorSystemName, hostname, port, conf, securityManager) val actorSystem = rpcEnv.asInstanceOf[AkkaRpcEnv].actorSystem 2. is created when: _taskScheduler.start() What is the difference and what does each do?
Error:(46, 66) not found: type SparkFlumeProtocol
I'm trying to build Spark using Intellij on Windows. But I'm repeatedly getting this error spark-master\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala Error:(46, 66) not found: type SparkFlumeProtocol val transactionTimeout: Int, val backOffInterval: Int) extends SparkFlumeProtocol with Logging { ^ Error:(72, 39) not found: type EventBatch override def getEventBatch(n: Int): EventBatch = { ^ Error:(87, 13) not found: type EventBatch new EventBatch(Spark sink has been stopped!, , java.util.Collections.emptyList()) ^ I had the same error when using Linux, bit there I solved it by right clicking on the flume-sink - maven - generate sources and update folders. But on Windows, it doesn't seem to work. Any ideas? Thanks,
Newbie question: what makes Spark run faster than MapReduce
Consider the classic word count application over a 4 node cluster with a sizable working data. What makes Spark ran faster than MapReduce considering that Spark also has to write to disk during shuffle?
Spark is in-memory processing, how then can Tachyon make Spark faster?
Spark is an in-memory engine and attempts to do computation in-memory. Tachyon is memory-centeric distributed storage, OK, but how would that help ran Spark faster?
Re: Newbie question: can shuffle avoid writing and reading from disk?
Thanks! On Wed, Aug 5, 2015 at 5:24 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Yes, finally shuffle data will be written to disk for reduce stage to pull, no matter how large you set to shuffle memory fraction. Thanks Saisai On Thu, Aug 6, 2015 at 7:50 AM, Muler mulugeta.abe...@gmail.com wrote: thanks, so if I have enough large memory (with enough spark.shuffle.memory) then shuffle (in-memory shuffle) spill doesn't happen (per node) but still shuffle data has to be ultimately written to disk so that reduce stage pulls if across network? On Wed, Aug 5, 2015 at 4:40 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Hi Muler, Shuffle data will be written to disk, no matter how large memory you have, large memory could alleviate shuffle spill where temporary file will be generated if memory is not enough. Yes, each node writes shuffle data to file and pulled from disk in reduce stage from network framework (default is Netty). Thanks Saisai On Thu, Aug 6, 2015 at 7:10 AM, Muler mulugeta.abe...@gmail.com wrote: Hi, Consider I'm running WordCount with 100m of data on 4 node cluster. Assuming my RAM size on each node is 200g and i'm giving my executors 100g (just enough memory for 100m data) 1. If I have enough memory, can Spark 100% avoid writing to disk? 2. During shuffle, where results have to be collected from nodes, does each node write to disk and then the results are pulled from disk? If not, what is the API that is being used to pull data from nodes across the cluster? (I'm thinking what Scala or Java packages would allow you to read in-memory data from other machines?) Thanks,
Newbie question: can shuffle avoid writing and reading from disk?
Hi, Consider I'm running WordCount with 100m of data on 4 node cluster. Assuming my RAM size on each node is 200g and i'm giving my executors 100g (just enough memory for 100m data) 1. If I have enough memory, can Spark 100% avoid writing to disk? 2. During shuffle, where results have to be collected from nodes, does each node write to disk and then the results are pulled from disk? If not, what is the API that is being used to pull data from nodes across the cluster? (I'm thinking what Scala or Java packages would allow you to read in-memory data from other machines?) Thanks,
Re: Newbie question: can shuffle avoid writing and reading from disk?
thanks, so if I have enough large memory (with enough spark.shuffle.memory) then shuffle (in-memory shuffle) spill doesn't happen (per node) but still shuffle data has to be ultimately written to disk so that reduce stage pulls if across network? On Wed, Aug 5, 2015 at 4:40 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Hi Muler, Shuffle data will be written to disk, no matter how large memory you have, large memory could alleviate shuffle spill where temporary file will be generated if memory is not enough. Yes, each node writes shuffle data to file and pulled from disk in reduce stage from network framework (default is Netty). Thanks Saisai On Thu, Aug 6, 2015 at 7:10 AM, Muler mulugeta.abe...@gmail.com wrote: Hi, Consider I'm running WordCount with 100m of data on 4 node cluster. Assuming my RAM size on each node is 200g and i'm giving my executors 100g (just enough memory for 100m data) 1. If I have enough memory, can Spark 100% avoid writing to disk? 2. During shuffle, where results have to be collected from nodes, does each node write to disk and then the results are pulled from disk? If not, what is the API that is being used to pull data from nodes across the cluster? (I'm thinking what Scala or Java packages would allow you to read in-memory data from other machines?) Thanks,