Why does sortByKey() transformation trigger a job in spark-shell?
Hi Sparkians, I use the latest Spark 1.6.0-SNAPSHOT in spark-shell with the default local[*] master. I created an RDD of pairs using the following snippet: val rdd = sc.parallelize(0 to 5).map(n => (n, util.Random.nextBoolean)) It's all fine so far. The map transformation causes no computation. I thought all transformations are lazy and trigger no job until an action's called. It seems I was wrong with sortByKey()! When I called `rdd.sortByKey()`, it started a job: sortByKey at :27 (!) Can anyone explain what makes for the different behaviour of sortByKey since it is a transformation and hence should be lazy? Is this a special transformation? Pozdrawiam, Jacek -- Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl Follow me at https://twitter.com/jaceklaskowski Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Why does sortByKey() transformation trigger a job in spark-shell?
Hi, Answering my own question after...searching sortByKey in the mailing list archives and later in JIRA. It turns out it's a known issue and filed under https://issues.apache.org/jira/browse/SPARK-1021 "sortByKey() launches a cluster job when it shouldn't". It's labelled "starter" that should not be that hard to fix. Does this still hold? I'd like to work on it if it's "simple" and doesn't get me swamped. Thanks! Pozdrawiam, Jacek -- Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl Follow me at https://twitter.com/jaceklaskowski Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski On Mon, Nov 2, 2015 at 2:34 PM, Jacek Laskowskiwrote: > Hi Sparkians, > > I use the latest Spark 1.6.0-SNAPSHOT in spark-shell with the default > local[*] master. > > I created an RDD of pairs using the following snippet: > > val rdd = sc.parallelize(0 to 5).map(n => (n, util.Random.nextBoolean)) > > It's all fine so far. The map transformation causes no computation. > > I thought all transformations are lazy and trigger no job until an > action's called. It seems I was wrong with sortByKey()! When I called > `rdd.sortByKey()`, it started a job: sortByKey at :27 (!) > > Can anyone explain what makes for the different behaviour of sortByKey > since it is a transformation and hence should be lazy? Is this a > special transformation? > > Pozdrawiam, > Jacek > > -- > Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl > Follow me at https://twitter.com/jaceklaskowski > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Why does sortByKey() transformation trigger a job in spark-shell?
Hah! No, that is not a "starter" issue. It touches on some fairly deep Spark architecture, and there have already been a few attempts to resolve the issue -- none entirely satisfactory, but you should definitely search out the work that has already been done. On Mon, Nov 2, 2015 at 5:51 AM, Jacek Laskowskiwrote: > Hi, > > Answering my own question after...searching sortByKey in the mailing > list archives and later in JIRA. > > It turns out it's a known issue and filed under > https://issues.apache.org/jira/browse/SPARK-1021 "sortByKey() launches > a cluster job when it shouldn't". > > It's labelled "starter" that should not be that hard to fix. Does this > still hold? I'd like to work on it if it's "simple" and doesn't get me > swamped. Thanks! > > Pozdrawiam, > Jacek > > -- > Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl > Follow me at https://twitter.com/jaceklaskowski > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski > > > On Mon, Nov 2, 2015 at 2:34 PM, Jacek Laskowski wrote: > > Hi Sparkians, > > > > I use the latest Spark 1.6.0-SNAPSHOT in spark-shell with the default > > local[*] master. > > > > I created an RDD of pairs using the following snippet: > > > > val rdd = sc.parallelize(0 to 5).map(n => (n, util.Random.nextBoolean)) > > > > It's all fine so far. The map transformation causes no computation. > > > > I thought all transformations are lazy and trigger no job until an > > action's called. It seems I was wrong with sortByKey()! When I called > > `rdd.sortByKey()`, it started a job: sortByKey at :27 (!) > > > > Can anyone explain what makes for the different behaviour of sortByKey > > since it is a transformation and hence should be lazy? Is this a > > special transformation? > > > > Pozdrawiam, > > Jacek > > > > -- > > Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl > > Follow me at https://twitter.com/jaceklaskowski > > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >