You don’t need to know rdd dependencies to maximize dependencies. Internally
the scheduler will construct the DAG and trigger the execution if there is no
shuffle dependencies in between RDDs.
Thanks.
Zhan Zhang
On Feb 26, 2015, at 1:28 PM, Corey Nolet cjno...@gmail.com wrote:
Let's say I'm
What confused me is the statement of The final result is that rdd1 is
calculated twice.” Is it the expected behavior?
Thanks.
Zhan Zhang
On Feb 26, 2015, at 3:03 PM, Sean Owen
so...@cloudera.commailto:so...@cloudera.com wrote:
To distill this a bit further, I don't think you actually want
I think I'm getting more confused the longer this thread goes. So
rdd1.dependencies provides immediate parents to rdd1. For now i'm going to
walk my internal DAG from the root down and see where running the caching
of siblings concurrently gets me.
I still like your point, Sean, about trying to
Let's say I'm given 2 RDDs and told to store them in a sequence file and
they have the following dependency:
val rdd1 = sparkContext.sequenceFile().cache()
val rdd2 = rdd1.map()
How would I tell programmatically without being the one who built rdd1 and
rdd2 whether or not rdd2
I see the rdd.dependencies() function, does that include ALL the
dependencies of an RDD? Is it safe to assume I can say
rdd2.dependencies.contains(rdd1)?
On Thu, Feb 26, 2015 at 4:28 PM, Corey Nolet cjno...@gmail.com wrote:
Let's say I'm given 2 RDDs and told to store them in a sequence file
no, it does not give you transitive dependencies. You'd have to walk the
tree of dependencies yourself, but that should just be a few lines.
On Thu, Feb 26, 2015 at 3:32 PM, Corey Nolet cjno...@gmail.com wrote:
I see the rdd.dependencies() function, does that include ALL the
dependencies of
To distill this a bit further, I don't think you actually want rdd2 to
wait on rdd1 in this case. What you want is for a request for
partition X to wait if partition X is already being calculated in a
persisted RDD. Otherwise the first partition of rdd2 waits on the
final partition of rdd1 even
In this case, it is slow to wait for rdd1.saveAsHasoopFile(...) to finish
probably due to writing to hdfs. a walk around for this particular case may be
as follows.
val rdd1 = ..cache()
val rdd2 = rdd1.map().()
rdd1.count
future { rdd1.saveAsHasoopFile(...) }
future {
Zhan,
I think it might be helpful to point out that I'm trying to run the RDDs in
different threads to maximize the amount of work that can be done
concurrently. Unfortunately, right now if I had something like this:
val rdd1 = ..cache()
val rdd2 = rdd1.map().()
future {
What confused me is the statement of *The final result is that rdd1 is
calculated twice.” *Is it the expected behavior?
To be perfectly honest, performing an action on a cached RDD in two
different threads and having them (at the partition level) block until the
parent are cached would be the
Yeah, I believe Corey knows that much and is using foreachPartition(i
= None) to materialize. The question is, how would you do this with
an arbitrary DAG? in this simple example we know what the answer is
but he's trying to do it programmatically.
On Thu, Feb 26, 2015 at 11:54 PM, Zhan Zhang
I think we already covered that in this thread. You get dependencies
from RDD.dependencies()
On Fri, Feb 27, 2015 at 12:31 AM, Zhan Zhang zzh...@hortonworks.com wrote:
Currently in spark, it looks like there is no easy way to know the
dependencies. It is solved at run time.
Thanks.
Zhan
The issue is that both RDDs are being evaluated at once. rdd1 is
cached, which means that as its partitions are evaluated, they are
persisted. Later requests for the partition hit the cached partition.
But we have two threads causing two jobs to evaluate partitions of
rdd1 at the same time. If
bq. whether or not rdd1 is a cached rdd
RDD has getStorageLevel method which would return the RDD's current storage
level.
SparkContext has this method:
* Return information about what RDDs are cached, if they are in mem or
on disk, how much space
* they take, etc.
*/
@DeveloperApi
I should probably mention that my example case is much over simplified-
Let's say I've got a tree, a fairly complex one where I begin a series of
jobs at the root which calculates a bunch of really really complex joins
and as I move down the tree, I'm creating reports from the data that's
already
Zhan,
This is exactly what I'm trying to do except, as I metnioned in my first
message, I am being given rdd1 and rdd2 only and I don't necessarily know
at that point whether or not rdd1 is a cached rdd. Further, I don't know at
that point whether or not rdd2 depends on rdd1.
On Thu, Feb 26,
Ted. That one I know. It was the dependency part I was curious about
On Feb 26, 2015 7:12 PM, Ted Yu yuzhih...@gmail.com wrote:
bq. whether or not rdd1 is a cached rdd
RDD has getStorageLevel method which would return the RDD's current
storage level.
SparkContext has this method:
*
Currently in spark, it looks like there is no easy way to know the
dependencies. It is solved at run time.
Thanks.
Zhan Zhang
On Feb 26, 2015, at 4:20 PM, Corey Nolet
cjno...@gmail.commailto:cjno...@gmail.com wrote:
Ted. That one I know. It was the dependency part I was curious about
On Feb
18 matches
Mail list logo