Re: Modifying an RDD in forEach

Mohit Jaggi Sat, 06 Dec 2014 14:06:06 -0800

“ideal for iterative workloads” is a comparison to hadoop map-reduce. if you 
are happy with a single machine, by all means, do that.


scaling out may be useful when:
1) you want to finish the task faster by using more machines. this may not 
involve any additional cost if you are using utility computing like AWS. e.g. 
if the cost of 1 machine for 1 hour is the same as the cost of 60 machines for 
a minute, but you get your results 60 times faster
2) if you may have larger data. at some point you will run out of “vertical 
scaling” options or they will become prohibitively expensive [e.g. you had 
everything working for 100 million docs but then you got 10 more docs. now do 
you buy and install more DIMMs in your server?]
3) if you are using utility computing like AWS and there is a cliff drop in 
pricing for smaller machines [this is common]
4) what if you want to modify your algorithm and now it needs a few more bytes 
of memory? go to the store, buy DIMMs, and install in your server?

> On Dec 6, 2014, at 1:42 PM, Ron Ayoub <ronalday...@live.com> wrote:
> 
> These are very interesting comments. The vast majority of cases I'm working 
> on are going to be in the 3 million range and 100 million was thrown out as 
> something to shoot for. I upped it to 500 million. But all things 
> considering, I believe I may be able to directly translate what I have to 
> Java Streams API and run 100 million docs on 32 cores in under an hour or two 
> which would suit our needs. Up until this point I've been focused on 
> computational aspect 
> 
> If I can scale up to clustering 100 million documents on a single machine I 
> can probably directly translate what I have to Java Streams API and be 
> faster. It is that scaling out that changes things. These are interesting 
> comments. I think in this hierarchical k-means case the lazy evaluation 
> becomes almost useless and perhaps even an impediment. Part of the problem is 
> that I've been a bit too focused on math/information retrieval and have to 
> update a bit on functional approach to programming so I can better utilize 
> the tools But it does appear that Spark may not be the best option for this 
> need. I don't need resiliency or fault tolerance as much as I need to be able 
> to execute an algorithm on a large amount of data fast and then be done with 
> it. I'm now thinking that in the 100 million document range I may be ok 
> clustering feature vectors with no more than 25 features per doc on a single 
> machine with 32 cores and a load of memory. I might directly translate what I 
> have to Java 8 Streams API. 
> 
> There is also questions of proportion. Perhaps what I have is not big enough 
> to warrant or require scaling out. I may have other uses for Spark in 
> traditional map-reduce algorithms such as counting pairs of shared shingles 
> for near dupe detection but to this point I've found Oracles 
> parallel-pipelined table functions, while not glamorous are doing quite well 
> in DB. 
> 
> I'm just a bit confused still on why it is advertised ideal for iterative 
> algorithms when iterative algorithms have that point per iteration where 
> things do get evaluated and laziness is not terribly useful. Ideal for 
> massive in-memory cluster computing yes - but iterative... ? not sure. I have 
> that book "Functional Programming in Scala" and I hope to read it someday and 
> enrich my understanding here. 
> 
> Subject: Re: Modifying an RDD in forEach
> From: mohitja...@gmail.com
> Date: Sat, 6 Dec 2014 13:13:50 -0800
> CC: ronalday...@live.com; user@spark.apache.org
> To: mayur.rust...@gmail.com
> 
> Ron,
> “appears to be working” might be true when there are no failures. on large 
> datasets being processed on a large number of machines, failures of several 
> types(server, network, disk etc) can happen. At that time, Spark will not 
> “know” that you changed the RDD in-place and will use any version of any 
> partition of the RDD to be retried. Retries require idempotency and that is 
> difficult without immutability. I believe, this is one of the primary reasons 
> for making RDDs immutable in Spark (mutable isn't even an option worth 
> considering). In general mutating something in a distributed system is a hard 
> problem. It can be solved (e.g. in NoSQL or newSQL databases) but Spark is 
> not a transactional data store.
> 
> If you are building an iterative machine learning algorithm which usually 
> have a “reduce” step at the end of every iteration, then the lazy evaluation 
> is unlikely to be useful. On the other hand, if these intermediate RDDs stay 
> in the young generation of the JVM heap [I am not sure if RDD cache 
> management somehow changes this, so I could be wrong] they are garbage 
> collected quickly and with very little overhead.
> 
> This is the price of scaling out :-)
>       
> Hope this helps,
> Mohit.
> 
> On Dec 6, 2014, at 5:02 AM, Mayur Rustagi <mayur.rust...@gmail.com 
> <mailto:mayur.rust...@gmail.com>> wrote:
> 
> You'll benefit by viewing Matei's talk in Yahoo on Spark internals and how it 
> optimizes execution of iterative jobs.
> Simple answer is 
> 1. Spark doesn't materialize RDD when you do an iteration but lazily captures 
> the transformation functions in RDD.(only function and closure , no data 
> operation actually happens)
> 2. When you finally execute and want to cause effects (save to disk , collect 
> on master etc) it views the DAG of execution and optimizes what it can reason 
> (eliminating intermediate states , performing multiple Transformations in one 
> tasks, leveraging partitioning where available among others)
> Bottom line it doesn't matter how many RDD you have in your DAG chain as long 
> as Spark can optimize the functions in that DAG to create minimal 
> materialization on its way to final output. 
> Regards
> Mayur
> On 06-Dec-2014 6:12 pm, "Ron Ayoub" <ronalday...@live.com 
> <mailto:ronalday...@live.com>> wrote:
> This is from a separate thread with a differently named title. 
> 
> Why can't you modify the actual contents of an RDD using forEach? It appears 
> to be working for me. What I'm doing is changing cluster assignments and 
> distances per data item for each iteration of the clustering algorithm. The 
> clustering algorithm is massive and iterates thousands of times. As I 
> understand it now, you are supposed to create new RDDs on each pass. This is 
> a hierachical k-means that I'm doing and hence it is consist of many 
> iterations rather than large iterations.
> 
> So I understand the restriction of why operation when aggregating and 
> reducing etc, need to be associative. However, forEach operates on a single 
> item. So being that Spark is advertised as being great for iterative 
> algorithms since it operates in-memory, how can it be good to create 
> thousands upon thousands of RDDs during the course of an iterative algorithm? 
>  Does Spark have some kind of trick like reuse behind the scenes - fully 
> persistent data objects or whatever? How can it possibly be efficient for 
> 'iterative' algorithms when it is creating so many RDDs as opposed to one? 
> 
> Or is the answer that I should keep doing what I'm doing because it is 
> working even though it is not theoretically sound and aligned with functional 
> ideas. I personally just want it to be fast and be able to operate on up to 
> 500 million data items.

Re: Modifying an RDD in forEach

Reply via email to