Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-16 Thread andy petrella
Heya TD, Thanks for the detailed answer! Much appreciated. Regarding order among elements within an RDD, you're definitively right, it'd kill the //ism and would require synchronization which is completely avoided in distributed env. That's why, I won't push this constraint to the RDDs

Resource allocations

2014-07-16 Thread rapelly kartheek
Hi, I am trying to understand how the resource allocation happens in spark. I understand the resourceOffer method in taskScheduler. This method takes care of locality factor while allocating the resources. This resourceOffer method gets invoked by the corresponding cluster manager. I am working

Re: Resource allocations

2014-07-16 Thread Kay Ousterhout
Hi Karthik, The resourceOffer() method is invoked from a class implementing the SchedulerBackend interface; in the case of a standalone cluster, it's invoked from a CoarseGrainedSchedulerBackend (in the makeOffers() method). If you look in TaskSchedulerImpl.submitTasks(), it calls

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-16 Thread Tathagata Das
I think it makes sense, though without a concrete implementation its hard to be sure. Applying sorting on the RDD according to the RDDs makes sense, but I can think of two kinds of fundamental problems. 1. How do you deal with ordering across RDD boundaries. Say two consecutive RDDs in the

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-16 Thread andy petrella
Indeed, these two cases are tightly coupled (the first one is a special case of the second). Actually, these outliers could be handled by a dedicated function what I named outliersManager -- I was not so much inspired ^^, but we could name these outliers, outlaws and thus the function would be

Re: Possible bug in ClientBase.scala?

2014-07-16 Thread Chester Chen
Hi, Sandy We do have some issue with this. The difference is in Yarn-Alpha and Yarn Stable ( I noticed that in the latest build, the module name has changed, yarn-alpha -- yarn yarn -- yarn-stable ) For example: MRJobConfig.class the field: DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH

Does RDD checkpointing store the entire state in HDFS?

2014-07-16 Thread Yan Fang
Hi guys, am wondering how the RDD checkpointing https://spark.apache.org/docs/latest/streaming-programming-guide.html#RDD Checkpointing works in Spark Streaming. When I use updateStateByKey, does the Spark store the entire state (at one time point) into the HDFS or only put the transformation

Re: Possible bug in ClientBase.scala?

2014-07-16 Thread Chester Chen
Hmm looks like a Build script issue: I run the command with : sbt/sbt clean *yarn/*test:compile but errors came from [error] 40 errors found [error] (*yarn-stable*/compile:compile) Compilation failed Chester On Wed, Jul 16, 2014 at 5:18 PM, Chester Chen ches...@alpinenow.com wrote: Hi,

Re: Does RDD checkpointing store the entire state in HDFS?

2014-07-16 Thread Tathagata Das
After every checkpointing interval, the latest state RDD is stored to HDFS in its entirety. Along with that, the series of DStream transformations that was setup with the streaming context is also stored into HDFS (the whole DAG of DStream objects is serialized and saved). TD On Wed, Jul 16,

Re: Possible bug in ClientBase.scala?

2014-07-16 Thread Chester Chen
Looking further, the yarn and yarn-stable are both for the stable version of Yarn, that explains the compilation errors when using 2.0.5-alpha version of hadoop. the module yarn-alpha ( although is still on SparkBuild.scala), is no longer there in sbt console. projects [info] In

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Matei Zaharia
Hey Reynold, just to clarify, users will still have to manually broadcast objects that they want to use *across* operations (e.g. in multiple iterations of an algorithm, or multiple map functions, or stuff like that). But they won't have to broadcast something they only use once. Matei On Jul

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Reynold Xin
Yup - that is correct. Thanks for clarifying. On Wed, Jul 16, 2014 at 10:12 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Reynold, just to clarify, users will still have to manually broadcast objects that they want to use *across* operations (e.g. in multiple iterations of an

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Stephen Haberman
Wow. Great writeup. I keep tabs on several open source projects that we use heavily, and I'd be ecstatic if more major changes were this well/succinctly explained instead of the usual just read the commit message/diff. - Stephen

Re: Hadoop's Configuration object isn't threadsafe

2014-07-16 Thread Andrew Ash
Hi Patrick, thanks for taking a look. I filed as https://issues.apache.org/jira/browse/SPARK-2546 Would you recommend I pursue the cloned Configuration object approach now and send in a PR? Reynold's recent announcement of the broadcast RDD object patch may also have implications of the right