subject:"How to tell if one RDD depends on another"

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet

I think I'm getting more confused the longer this thread goes. So rdd1.dependencies provides immediate parents to rdd1. For now i'm going to walk my internal DAG from the root down and see where running the caching of siblings concurrently gets me. I still like your point, Sean, about trying to do

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen

I think we already covered that in this thread. You get dependencies from RDD.dependencies() On Fri, Feb 27, 2015 at 12:31 AM, Zhan Zhang wrote: > Currently in spark, it looks like there is no easy way to know the > dependencies. It is solved at run time. > > Thanks. > > Zhan Zhang > > On Feb 26,

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang

Currently in spark, it looks like there is no easy way to know the dependencies. It is solved at run time. Thanks. Zhan Zhang On Feb 26, 2015, at 4:20 PM, Corey Nolet mailto:cjno...@gmail.com>> wrote: Ted. That one I know. It was the dependency part I was curious about On Feb 26, 2015 7:12 P

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang

I miss that part. Thanks for the explanation. It is a challenging problem implementation wise. To do it programmatically, 1. pre-analyze all DAGs to form a complete DAG with root as the source, and leaf as all actions. 2. Any RDD(node) that has more than one downstream nodes needs to be marked

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet

Ted. That one I know. It was the dependency part I was curious about On Feb 26, 2015 7:12 PM, "Ted Yu" wrote: > bq. whether or not rdd1 is a cached rdd > > RDD has getStorageLevel method which would return the RDD's current > storage level. > > SparkContext has this method: >* Return informat

Re: How to tell if one RDD depends on another

2015-02-26 Thread Ted Yu

bq. whether or not rdd1 is a cached rdd RDD has getStorageLevel method which would return the RDD's current storage level. SparkContext has this method: * Return information about what RDDs are cached, if they are in mem or on disk, how much space * they take, etc. */ @DeveloperApi d

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen

Yeah, I believe Corey knows that much and is using foreachPartition(i => None) to materialize. The question is, how would you do this with an arbitrary DAG? in this simple example we know what the answer is but he's trying to do it programmatically. On Thu, Feb 26, 2015 at 11:54 PM, Zhan Zhang wr

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet

Zhan, This is exactly what I'm trying to do except, as I metnioned in my first message, I am being given rdd1 and rdd2 only and I don't necessarily know at that point whether or not rdd1 is a cached rdd. Further, I don't know at that point whether or not rdd2 depends on rdd1. On Thu, Feb 26, 2015

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang

In this case, it is slow to wait for rdd1.saveAsHasoopFile(...) to finish probably due to writing to hdfs. a walk around for this particular case may be as follows. val rdd1 = ..cache() val rdd2 = rdd1.map().() rdd1.count future { rdd1.saveAsHasoopFile(...) } future { rdd2.saveAsHadoo

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet

> What confused me is the statement of *"The final result is that rdd1 is calculated twice.” *Is it the expected behavior? To be perfectly honest, performing an action on a cached RDD in two different threads and having them (at the partition level) block until the parent are cached would be the

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet

I should probably mention that my example case is much over simplified- Let's say I've got a tree, a fairly complex one where I begin a series of jobs at the root which calculates a bunch of really really complex joins and as I move down the tree, I'm creating reports from the data that's already b

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen

The issue is that both RDDs are being evaluated at once. rdd1 is cached, which means that as its partitions are evaluated, they are persisted. Later requests for the partition hit the cached partition. But we have two threads causing two jobs to evaluate partitions of rdd1 at the same time. If they

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang

What confused me is the statement of "The final result is that rdd1 is calculated twice.” Is it the expected behavior? Thanks. Zhan Zhang On Feb 26, 2015, at 3:03 PM, Sean Owen mailto:so...@cloudera.com>> wrote: To distill this a bit further, I don't think you actually want rdd2 to wait on r

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen

To distill this a bit further, I don't think you actually want rdd2 to wait on rdd1 in this case. What you want is for a request for partition X to wait if partition X is already being calculated in a persisted RDD. Otherwise the first partition of rdd2 waits on the final partition of rdd1 even whe

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet

Zhan, I think it might be helpful to point out that I'm trying to run the RDDs in different threads to maximize the amount of work that can be done concurrently. Unfortunately, right now if I had something like this: val rdd1 = ..cache() val rdd2 = rdd1.map().() future { rdd1.saveAsHaso

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang

You don’t need to know rdd dependencies to maximize dependencies. Internally the scheduler will construct the DAG and trigger the execution if there is no shuffle dependencies in between RDDs. Thanks. Zhan Zhang On Feb 26, 2015, at 1:28 PM, Corey Nolet wrote: > Let's say I'm given 2 RDDs and

Re: How to tell if one RDD depends on another

2015-02-26 Thread Imran Rashid

no, it does not give you transitive dependencies. You'd have to walk the tree of dependencies yourself, but that should just be a few lines. On Thu, Feb 26, 2015 at 3:32 PM, Corey Nolet wrote: > I see the "rdd.dependencies()" function, does that include ALL the > dependencies of an RDD? Is it s

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet

I see the "rdd.dependencies()" function, does that include ALL the dependencies of an RDD? Is it safe to assume I can say "rdd2.dependencies.contains(rdd1)"? On Thu, Feb 26, 2015 at 4:28 PM, Corey Nolet wrote: > Let's say I'm given 2 RDDs and told to store them in a sequence file and > they have

How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet

Let's say I'm given 2 RDDs and told to store them in a sequence file and they have the following dependency: val rdd1 = sparkContext.sequenceFile().cache() val rdd2 = rdd1.map() How would I tell programmatically without being the one who built rdd1 and rdd2 whether or not rdd2 depend

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

Re: How to tell if one RDD depends on another

How to tell if one RDD depends on another

19 matches

Site Navigation

Mail list logo

Footer information