Thanks TD. One more question:
We are building real time analytics using spark streaming - We read from kafka, process it (flatMap -> Map -> reduceByKeyAndWindow -> filter) and then save to Cassandra (using DStream.foreachRDD). Initially I used a machine with 32 cores, 32 GB and performed load testing. with 1 master and 1 worker. in the same box. Later I added one more box and launched worker on that box (32 core 16GB). I set spark.executor.memory=10G in driver program I expected the performance should increase linearly as mentioned in spark streaming video but it did not help. Can you please explain why it is so? Also how can we increase? Thanks, Sourav On Fri, Feb 14, 2014 at 7:50 AM, Tathagata Das <tathagata.das1...@gmail.com>wrote: > Answers inline. Hope these answer your questions. > > TD > > > On Thu, Feb 13, 2014 at 5:49 PM, Sourav Chandra < > sourav.chan...@livestream.com> wrote: > >> HI, >> >> I have couple of questions: >> >> 1. While going through the spark-streaming code, I found out there is one >> configuration in JobScheduler/Generator (spark.streaming.concurrentJobs) >> which is set to 1. There is no documentation for this parameter. After >> setting this to 1000 in driver program, our streaming application's >> performance is improved. >> > > That is a parameter that allows Spark Stremaing to launch multiple Spark > jobs simultaneously. While it can improve the performance in many scenarios > (as it has in your case), it can actually increase the processing time of > each batch and increase end-to-end latency in certain scenarios. So it is > something that needs to be used with caution. That said, we should have > definitely exposed it in the documentation. > > >> What is this variable used for? Is it safe to use/tweak this parameter? >> >> 2. Can someone explain the usage of MapOutputTracker, BlockManager >> component. I have gone through the youtube video of Matei about spark >> internals but this was not covered in detail. >> > > I am not sure if there is a detailed document anywhere that explains but I > can give you a high level overview of the both. > > BlockManager is like a distributed key-value store for large blobs (called > blocks) of data. It has a master-worker architecture (loosely it is like > the HDFS file system) where the BlockManager at the workers store the data > blocks and BlockManagerMaster stores the metadata for what blocks are > stored where. All the cached RDD's partitions and shuffle data are stored > and managed by the BlockManager. It also transfers the blocks between the > workers as needed (shuffles etc all happen through the block manager). > Specifically for spark streaming, the data received from outside is stored > in the BlockManager of the worker nodes, and the IDs of the blocks are > reported to the BlockManagerMaster. > > MapOutputTrackers is a simpler component that keeps track of the location > of the output of the map stage, so that workers running the reduce stage > knows which machines to pull the data from. That also has the master-worker > component - master has the full knowledge of the mapoutput and the worker > component on-demand pulls that knowledge from the master component when the > reduce tasks are executed on the worker. > > > >> >> 3. Can someone explain the usage of cache w.r.t spark streaming? For >> example if we do stream.cache(), will the cache remain constant with all >> the partitions of RDDs present across the nodes for that stream, OR will it >> be regularly updated as in while new batch is coming? >> >> If you call DStream.persist (persist == cache = true), then all RDDs > generated by the DStream will be persisted in the cache (in the > BlockManager). As new RDDs are generated and persisted, old RDDs from the > same DStream will fall out of memory. either by LRU or explicitly if > spark.streaming.unpersist is set to true. > > >> Thanks, >> -- >> >> Sourav Chandra >> >> Senior Software Engineer >> >> · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · >> >> sourav.chan...@livestream.com >> >> o: +91 80 4121 8723 >> >> m: +91 988 699 3746 >> >> skype: sourav.chandra >> >> Livestream >> >> "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd >> Block, Koramangala Industrial Area, >> >> Bangalore 560034 >> >> www.livestream.com >> > > -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream "Ajmera Summit", First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com