Why does sortByKey() transformation trigger a job in spark-shell?

2015-11-02 Thread Jacek Laskowski
Hi Sparkians,

I use the latest Spark 1.6.0-SNAPSHOT in spark-shell with the default
local[*] master.

I created an RDD of pairs using the following snippet:

val rdd = sc.parallelize(0 to 5).map(n => (n, util.Random.nextBoolean))

It's all fine so far. The map transformation causes no computation.

I thought all transformations are lazy and trigger no job until an
action's called. It seems I was wrong with sortByKey()! When I called
`rdd.sortByKey()`, it started a job: sortByKey at :27 (!)

Can anyone explain what makes for the different behaviour of sortByKey
since it is a transformation and hence should be lazy? Is this a
special transformation?

Pozdrawiam,
Jacek

--
Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why does sortByKey() transformation trigger a job in spark-shell?

2015-11-02 Thread Jacek Laskowski
Hi,

Answering my own question after...searching sortByKey in the mailing
list archives and later in JIRA.

It turns out it's a known issue and filed under
https://issues.apache.org/jira/browse/SPARK-1021 "sortByKey() launches
a cluster job when it shouldn't".

It's labelled "starter" that should not be that hard to fix. Does this
still hold? I'd like to work on it if it's "simple" and doesn't get me
swamped. Thanks!

Pozdrawiam,
Jacek

--
Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski


On Mon, Nov 2, 2015 at 2:34 PM, Jacek Laskowski  wrote:
> Hi Sparkians,
>
> I use the latest Spark 1.6.0-SNAPSHOT in spark-shell with the default
> local[*] master.
>
> I created an RDD of pairs using the following snippet:
>
> val rdd = sc.parallelize(0 to 5).map(n => (n, util.Random.nextBoolean))
>
> It's all fine so far. The map transformation causes no computation.
>
> I thought all transformations are lazy and trigger no job until an
> action's called. It seems I was wrong with sortByKey()! When I called
> `rdd.sortByKey()`, it started a job: sortByKey at :27 (!)
>
> Can anyone explain what makes for the different behaviour of sortByKey
> since it is a transformation and hence should be lazy? Is this a
> special transformation?
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why does sortByKey() transformation trigger a job in spark-shell?

2015-11-02 Thread Mark Hamstra
Hah!  No, that is not a "starter" issue.  It touches on some fairly deep
Spark architecture, and there have already been a few attempts to resolve
the issue -- none entirely satisfactory, but you should definitely search
out the work that has already been done.

On Mon, Nov 2, 2015 at 5:51 AM, Jacek Laskowski  wrote:

> Hi,
>
> Answering my own question after...searching sortByKey in the mailing
> list archives and later in JIRA.
>
> It turns out it's a known issue and filed under
> https://issues.apache.org/jira/browse/SPARK-1021 "sortByKey() launches
> a cluster job when it shouldn't".
>
> It's labelled "starter" that should not be that hard to fix. Does this
> still hold? I'd like to work on it if it's "simple" and doesn't get me
> swamped. Thanks!
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Mon, Nov 2, 2015 at 2:34 PM, Jacek Laskowski  wrote:
> > Hi Sparkians,
> >
> > I use the latest Spark 1.6.0-SNAPSHOT in spark-shell with the default
> > local[*] master.
> >
> > I created an RDD of pairs using the following snippet:
> >
> > val rdd = sc.parallelize(0 to 5).map(n => (n, util.Random.nextBoolean))
> >
> > It's all fine so far. The map transformation causes no computation.
> >
> > I thought all transformations are lazy and trigger no job until an
> > action's called. It seems I was wrong with sortByKey()! When I called
> > `rdd.sortByKey()`, it started a job: sortByKey at :27 (!)
> >
> > Can anyone explain what makes for the different behaviour of sortByKey
> > since it is a transformation and hence should be lazy? Is this a
> > special transformation?
> >
> > Pozdrawiam,
> > Jacek
> >
> > --
> > Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl
> > Follow me at https://twitter.com/jaceklaskowski
> > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>