These are very interesting comments. The vast majority of cases I'm working on 
are going to be in the 3 million range and 100 million was thrown out as 
something to shoot for. I upped it to 500 million. But all things considering, 
I believe I may be able to directly translate what I have to Java Streams API 
and run 100 million docs on 32 cores in under an hour or two which would suit 
our needs. Up until this point I've been focused on computational aspect 
If I can scale up to clustering 100 million documents on a single machine I can 
probably directly translate what I have to Java Streams API and be faster. It 
is that scaling out that changes things. These are interesting comments. I 
think in this hierarchical k-means case the lazy evaluation becomes almost 
useless and perhaps even an impediment. Part of the problem is that I've been a 
bit too focused on math/information retrieval and have to update a bit on 
functional approach to programming so I can better utilize the tools But it 
does appear that Spark may not be the best option for this need. I don't need 
resiliency or fault tolerance as much as I need to be able to execute an 
algorithm on a large amount of data fast and then be done with it. I'm now 
thinking that in the 100 million document range I may be ok clustering feature 
vectors with no more than 25 features per doc on a single machine with 32 cores 
and a load of memory. I might directly translate what I have to Java 8 Streams 
API. 
There is also questions of proportion. Perhaps what I have is not big enough to 
warrant or require scaling out. I may have other uses for Spark in traditional 
map-reduce algorithms such as counting pairs of shared shingles for near dupe 
detection but to this point I've found Oracles parallel-pipelined table 
functions, while not glamorous are doing quite well in DB. 
I'm just a bit confused still on why it is advertised ideal for iterative 
algorithms when iterative algorithms have that point per iteration where things 
do get evaluated and laziness is not terribly useful. Ideal for massive 
in-memory cluster computing yes - but iterative... ? not sure. I have that book 
"Functional Programming in Scala" and I hope to read it someday and enrich my 
understanding here. 

Subject: Re: Modifying an RDD in forEach
From: mohitja...@gmail.com
Date: Sat, 6 Dec 2014 13:13:50 -0800
CC: ronalday...@live.com; user@spark.apache.org
To: mayur.rust...@gmail.com

Ron,“appears to be working” might be true when there are no failures. on large 
datasets being processed on a large number of machines, failures of several 
types(server, network, disk etc) can happen. At that time, Spark will not 
“know” that you changed the RDD in-place and will use any version of any 
partition of the RDD to be retried. Retries require idempotency and that is 
difficult without immutability. I believe, this is one of the primary reasons 
for making RDDs immutable in Spark (mutable isn't even an option worth 
considering). In general mutating something in a distributed system is a hard 
problem. It can be solved (e.g. in NoSQL or newSQL databases) but Spark is not 
a transactional data store.
If you are building an iterative machine learning algorithm which usually have 
a “reduce” step at the end of every iteration, then the lazy evaluation is 
unlikely to be useful. On the other hand, if these intermediate RDDs stay in 
the young generation of the JVM heap [I am not sure if RDD cache management 
somehow changes this, so I could be wrong] they are garbage collected quickly 
and with very little overhead.
This is the price of scaling out :-)    Hope this helps,Mohit.
On Dec 6, 2014, at 5:02 AM, Mayur Rustagi <mayur.rust...@gmail.com> 
wrote:You'll benefit by viewing Matei's talk in Yahoo on Spark internals and 
how it optimizes execution of iterative jobs.

Simple answer is 

1. Spark doesn't materialize RDD when you do an iteration but lazily captures 
the transformation functions in RDD.(only function and closure , no data 
operation actually happens)

2. When you finally execute and want to cause effects (save to disk , collect 
on master etc) it views the DAG of execution and optimizes what it can reason 
(eliminating intermediate states , performing multiple Transformations in one 
tasks, leveraging partitioning where available among others)

Bottom line it doesn't matter how many RDD you have in your DAG chain as long 
as Spark can optimize the functions in that DAG to create minimal 
materialization on its way to final output. 
Regards

Mayur


On 06-Dec-2014 6:12 pm, "Ron Ayoub" <ronalday...@live.com> wrote:



This is from a separate thread with a differently named title. 
Why can't you modify the actual contents of an RDD using forEach? It appears to 
be working for me. What I'm doing is changing cluster assignments and distances 
per data item for each iteration of the clustering algorithm. The clustering 
algorithm is massive and iterates thousands of times. As I understand it now, 
you are supposed to create new RDDs on each pass. This is a hierachical k-means 
that I'm doing and hence it is consist of many iterations rather than large 
iterations.
So I understand the restriction of why operation when aggregating and reducing 
etc, need to be associative. However, forEach operates on a single item. So 
being that Spark is advertised as being great for iterative algorithms since it 
operates in-memory, how can it be good to create thousands upon thousands of 
RDDs during the course of an iterative algorithm?  Does Spark have some kind of 
trick like reuse behind the scenes - fully persistent data objects or whatever? 
How can it possibly be efficient for 'iterative' algorithms when it is creating 
so many RDDs as opposed to one? 
Or is the answer that I should keep doing what I'm doing because it is working 
even though it is not theoretically sound and aligned with functional ideas. I 
personally just want it to be fast and be able to operate on up to 500 million 
data items.                                          


                                          

Reply via email to