date:20160405

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Nick Pentreath

+1 for this proposal - as you mention I think it's the defacto current
situation anyway.

Note that from a developer view it's just the user-facing API that will be
only "ml" - the majority of the actual algorithms still operate on RDDs
under the good currently.
On Wed, 6 Apr 2016 at 05:03, Chris Fregly  wrote:

> perhaps renaming to Spark ML would actually clear up code and
> documentation confusion?
>
> +1 for rename
>
> On Apr 5, 2016, at 7:00 PM, Reynold Xin  wrote:
>
> +1
>
> This is a no brainer IMO.
>
>
> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley 
> wrote:
>
>> +1  By the way, the JIRA for tracking (Scala) API parity is:
>> https://issues.apache.org/jira/browse/SPARK-4591
>>
>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia 
>> wrote:
>>
>>> This sounds good to me as well. The one thing we should pay attention to
>>> is how we update the docs so that people know to start with the spark.ml
>>> classes. Right now the docs list spark.mllib first and also seem more
>>> comprehensive in that area than in spark.ml, so maybe people naturally
>>> move towards that.
>>>
>>> Matei
>>>
>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
>>>
>>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>>> need to port over in order to reach feature parity. -Xiangrui
>>>
>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 Overall this sounds good to me. One question I have is that in
 addition to the ML algorithms we have a number of linear algebra
 (various distributed matrices) and statistical methods in the
 spark.mllib package. Is the plan to port or move these to the spark.ml
 namespace in the 2.x series ?

 Thanks
 Shivaram

 On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
 > FWIW, all of that sounds like a good plan to me. Developing one API is
 > certainly better than two.
 >
 > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng 
 wrote:
 >> Hi all,
 >>
 >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
 built
 >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
 API has
 >> been developed under the spark.ml package, while the old RDD-based
 API has
 >> been developed in parallel under the spark.mllib package. While it
 was
 >> easier to implement and experiment with new APIs under a new
 package, it
 >> became harder and harder to maintain as both packages grew bigger and
 >> bigger. And new users are often confused by having two sets of APIs
 with
 >> overlapped functions.
 >>
 >> We started to recommend the DataFrame-based API over the RDD-based
 API in
 >> Spark 1.5 for its versatility and flexibility, and we saw the
 development
 >> and the usage gradually shifting to the DataFrame-based API. Just
 counting
 >> the lines of Scala code, from 1.5 to the current master we added
 ~1
 >> lines to the DataFrame-based API while ~700 to the RDD-based API.
 So, to
 >> gather more resources on the development of the DataFrame-based API
 and to
 >> help users migrate over sooner, I want to propose switching
 RDD-based MLlib
 >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
 >>
 >> * We do not accept new features in the RDD-based spark.mllib
 package, unless
 >> they block implementing new features in the DataFrame-based spark.ml
 >> package.
 >> * We still accept bug fixes in the RDD-based API.
 >> * We will add more features to the DataFrame-based API in the 2.x
 series to
 >> reach feature parity with the RDD-based API.
 >> * Once we reach feature parity (possibly in Spark 2.2), we will
 deprecate
 >> the RDD-based API.
 >> * We will remove the RDD-based API from the main Spark repo in Spark
 3.0.
 >>
 >> Though the RDD-based API is already in de facto maintenance mode,
 this
 >> announcement will make it clear and hence important to both MLlib
 developers
 >> and users. So we’d greatly appreciate your feedback!
 >>
 >> (As a side note, people sometimes use “Spark ML” to refer to the
 >> DataFrame-based API or even the entire MLlib component. This also
 causes
 >> confusion. To be clear, “Spark ML” is not an official name and there
 are no
 >> plans to rename MLlib to “Spark ML” at this time.)
 >>
 >> Best,
 >> Xiangrui
 >
 > -
 > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 > For additional commands, e-mail: user-h...@spark.apache.org
 >

>>>
>>>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Chris Fregly

perhaps renaming to Spark ML would actually clear up code and documentation 
confusion?

+1 for rename 

> On Apr 5, 2016, at 7:00 PM, Reynold Xin  wrote:
> 
> +1
> 
> This is a no brainer IMO.
> 
> 
>> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley  wrote:
>> +1  By the way, the JIRA for tracking (Scala) API parity is: 
>> https://issues.apache.org/jira/browse/SPARK-4591
>> 
>>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia  
>>> wrote:
>>> This sounds good to me as well. The one thing we should pay attention to is 
>>> how we update the docs so that people know to start with the spark.ml 
>>> classes. Right now the docs list spark.mllib first and also seem more 
>>> comprehensive in that area than in spark.ml, so maybe people naturally move 
>>> towards that.
>>> 
>>> Matei
>>> 
 On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
 
 Yes, DB (cc'ed) is working on porting the local linear algebra library 
 over (SPARK-13944). There are also frequent pattern mining algorithms we 
 need to port over in order to reach feature parity. -Xiangrui
 
> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman 
>  wrote:
> Overall this sounds good to me. One question I have is that in
> addition to the ML algorithms we have a number of linear algebra
> (various distributed matrices) and statistical methods in the
> spark.mllib package. Is the plan to port or move these to the spark.ml
> namespace in the 2.x series ?
> 
> Thanks
> Shivaram
> 
> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
> > FWIW, all of that sounds like a good plan to me. Developing one API is
> > certainly better than two.
> >
> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  wrote:
> >> Hi all,
> >>
> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API 
> >> built
> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based 
> >> API has
> >> been developed under the spark.ml package, while the old RDD-based API 
> >> has
> >> been developed in parallel under the spark.mllib package. While it was
> >> easier to implement and experiment with new APIs under a new package, 
> >> it
> >> became harder and harder to maintain as both packages grew bigger and
> >> bigger. And new users are often confused by having two sets of APIs 
> >> with
> >> overlapped functions.
> >>
> >> We started to recommend the DataFrame-based API over the RDD-based API 
> >> in
> >> Spark 1.5 for its versatility and flexibility, and we saw the 
> >> development
> >> and the usage gradually shifting to the DataFrame-based API. Just 
> >> counting
> >> the lines of Scala code, from 1.5 to the current master we added ~1
> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, 
> >> to
> >> gather more resources on the development of the DataFrame-based API 
> >> and to
> >> help users migrate over sooner, I want to propose switching RDD-based 
> >> MLlib
> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
> >>
> >> * We do not accept new features in the RDD-based spark.mllib package, 
> >> unless
> >> they block implementing new features in the DataFrame-based spark.ml
> >> package.
> >> * We still accept bug fixes in the RDD-based API.
> >> * We will add more features to the DataFrame-based API in the 2.x 
> >> series to
> >> reach feature parity with the RDD-based API.
> >> * Once we reach feature parity (possibly in Spark 2.2), we will 
> >> deprecate
> >> the RDD-based API.
> >> * We will remove the RDD-based API from the main Spark repo in Spark 
> >> 3.0.
> >>
> >> Though the RDD-based API is already in de facto maintenance mode, this
> >> announcement will make it clear and hence important to both MLlib 
> >> developers
> >> and users. So we’d greatly appreciate your feedback!
> >>
> >> (As a side note, people sometimes use “Spark ML” to refer to the
> >> DataFrame-based API or even the entire MLlib component. This also 
> >> causes
> >> confusion. To be clear, “Spark ML” is not an official name and there 
> >> are no
> >> plans to rename MLlib to “Spark ML” at this time.)
> >>
> >> Best,
> >> Xiangrui
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Reynold Xin

+1

This is a no brainer IMO.


On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley 
wrote:

> +1  By the way, the JIRA for tracking (Scala) API parity is:
> https://issues.apache.org/jira/browse/SPARK-4591
>
> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia 
> wrote:
>
>> This sounds good to me as well. The one thing we should pay attention to
>> is how we update the docs so that people know to start with the spark.ml
>> classes. Right now the docs list spark.mllib first and also seem more
>> comprehensive in that area than in spark.ml, so maybe people naturally
>> move towards that.
>>
>> Matei
>>
>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
>>
>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>> need to port over in order to reach feature parity. -Xiangrui
>>
>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> Overall this sounds good to me. One question I have is that in
>>> addition to the ML algorithms we have a number of linear algebra
>>> (various distributed matrices) and statistical methods in the
>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>> namespace in the 2.x series ?
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>> > certainly better than two.
>>> >
>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng 
>>> wrote:
>>> >> Hi all,
>>> >>
>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>> built
>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>> API has
>>> >> been developed under the spark.ml package, while the old RDD-based
>>> API has
>>> >> been developed in parallel under the spark.mllib package. While it was
>>> >> easier to implement and experiment with new APIs under a new package,
>>> it
>>> >> became harder and harder to maintain as both packages grew bigger and
>>> >> bigger. And new users are often confused by having two sets of APIs
>>> with
>>> >> overlapped functions.
>>> >>
>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>> API in
>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>> development
>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>> counting
>>> >> the lines of Scala code, from 1.5 to the current master we added
>>> ~1
>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>>> to
>>> >> gather more resources on the development of the DataFrame-based API
>>> and to
>>> >> help users migrate over sooner, I want to propose switching RDD-based
>>> MLlib
>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>> >>
>>> >> * We do not accept new features in the RDD-based spark.mllib package,
>>> unless
>>> >> they block implementing new features in the DataFrame-based spark.ml
>>> >> package.
>>> >> * We still accept bug fixes in the RDD-based API.
>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>> series to
>>> >> reach feature parity with the RDD-based API.
>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>> deprecate
>>> >> the RDD-based API.
>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>> 3.0.
>>> >>
>>> >> Though the RDD-based API is already in de facto maintenance mode, this
>>> >> announcement will make it clear and hence important to both MLlib
>>> developers
>>> >> and users. So we’d greatly appreciate your feedback!
>>> >>
>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>> >> DataFrame-based API or even the entire MLlib component. This also
>>> causes
>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>> are no
>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>> >>
>>> >> Best,
>>> >> Xiangrui
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>>>
>>
>>
>

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-05 Thread Kostas Sakellis

>From both this and the JDK thread, I've noticed (including myself) that
people have different notions of compatibility guarantees between major and
minor versions.
A simple question I have is: What compatibility can we break between minor
vs. major releases?

It might be worth getting on the same page wrt compatibility guarantees.

Just a thought,
Kostas

On Tue, Apr 5, 2016 at 4:39 PM, Holden Karau  wrote:

> One minor downside to having both 2.10 and 2.11 (and eventually 2.12) is
> deprecation warnings in our builds that we can't fix without introducing a
> wrapper/ scala version specific code. This isn't a big deal, and if we drop
> 2.10 in the 3-6 month time frame talked about we can cleanup those warnings
> once we get there.
>
> On Fri, Apr 1, 2016 at 10:00 PM, Raymond Honderdors <
> raymond.honderd...@sizmek.com> wrote:
>
>> What about a seperate branch for scala 2.10?
>>
>>
>>
>> Sent from my Samsung Galaxy smartphone.
>>
>>
>>  Original message 
>> From: Koert Kuipers 
>> Date: 4/2/2016 02:10 (GMT+02:00)
>> To: Michael Armbrust 
>> Cc: Matei Zaharia , Mark Hamstra <
>> m...@clearstorydata.com>, Cody Koeninger , Sean Owen
>> , dev@spark.apache.org
>> Subject: Re: Discuss: commit to Scala 2.10 support for Spark 2.x
>> lifecycle
>>
>> as long as we don't lock ourselves into supporting scala 2.10 for the
>> entire spark 2 lifespan it sounds reasonable to me
>>
>> On Wed, Mar 30, 2016 at 3:25 PM, Michael Armbrust > > wrote:
>>
>>> +1 to Matei's reasoning.
>>>
>>> On Wed, Mar 30, 2016 at 9:21 AM, Matei Zaharia 
>>> wrote:
>>>
 I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the
 entire 2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's
 the default version we built with in 1.x. We want to make the transition
 from 1.x to 2.0 as easy as possible. In 2.0, we'll have the default
 downloads be for Scala 2.11, so people will more easily move, but we
 shouldn't create obstacles that lead to fragmenting the community and
 slowing down Spark 2.0's adoption. I've seen companies that stayed on an
 old Scala version for multiple years because switching it, or mixing
 versions, would affect the company's entire codebase.

 Matei

 On Mar 30, 2016, at 12:08 PM, Koert Kuipers  wrote:

 oh wow, had no idea it got ripped out

 On Wed, Mar 30, 2016 at 11:50 AM, Mark Hamstra  wrote:

> No, with 2.0 Spark really doesn't use Akka:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L744
>
> On Wed, Mar 30, 2016 at 9:10 AM, Koert Kuipers 
> wrote:
>
>> Spark still runs on akka. So if you want the benefits of the latest
>> akka (not saying we do, was just an example) then you need to drop scala
>> 2.10
>> On Mar 30, 2016 10:44 AM, "Cody Koeninger" 
>> wrote:
>>
>>> I agree with Mark in that I don't see how supporting scala 2.10 for
>>> spark 2.0 implies supporting it for all of spark 2.x
>>>
>>> Regarding Koert's comment on akka, I thought all akka dependencies
>>> have been removed from spark after SPARK-7997 and the recent removal
>>> of external/akka
>>>
>>> On Wed, Mar 30, 2016 at 9:36 AM, Mark Hamstra <
>>> m...@clearstorydata.com> wrote:
>>> > Dropping Scala 2.10 support has to happen at some point, so I'm not
>>> > fundamentally opposed to the idea; but I've got questions about
>>> how we go
>>> > about making the change and what degree of negative consequences
>>> we are
>>> > willing to accept.  Until now, we have been saying that 2.10
>>> support will be
>>> > continued in Spark 2.0.0.  Switching to 2.11 will be non-trivial
>>> for some
>>> > Spark users, so abruptly dropping 2.10 support is very likely to
>>> delay
>>> > migration to Spark 2.0 for those users.
>>> >
>>> > What about continuing 2.10 support in 2.0.x, but repeatedly making
>>> an
>>> > obvious announcement in multiple places that such support is
>>> deprecated,
>>> > that we are not committed to maintaining it throughout 2.x, and
>>> that it is,
>>> > in fact, scheduled to be removed in 2.1.0?
>>> >
>>> > On Wed, Mar 30, 2016 at 7:45 AM, Sean Owen 
>>> wrote:
>>> >>
>>> >> (This should fork as its own thread, though it began during
>>> discussion
>>> >> of whether to continue Java 7 support in Spark 2.x.)
>>> >>
>>> >> Simply: would like to more clearly take the temperature of all
>>> >> interested parties about whether to support Scala 2.10 in the
>>> Spark
>>> >> 2.x lifecycle. Some of

Re: BROKEN BUILD? Is this only me or not?

2016-04-05 Thread Ted Yu

Probably related to Java 8.

I used :

$ java -version
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

On Tue, Apr 5, 2016 at 6:32 PM, Jacek Laskowski  wrote:

> Hi Ted,
>
> This is a similar issue
> https://issues.apache.org/jira/browse/SPARK-12530. I've fixed today's
> one and am sending a pull req.
>
> My build command is as follows:
>
> ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.2 -Phive
> -Phive-thriftserver -DskipTests clean install
>
> I'm on Java 8 / Mac OS X
>
> ➜  spark git:(master) ✗ java -version
> java version "1.8.0_77"
> Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Apr 5, 2016 at 8:41 PM, Ted Yu  wrote:
> > Looking at recent
> >
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7
> > builds, there was no such error.
> > I don't see anything wrong with the code:
> >
> >   usage = "_FUNC_(str) - " +
> > "Returns str, with the first letter of each word in uppercase, all
> other
> > letters in " +
> >
> > Mind refresh and build again ?
> >
> > If it still fails, please share the build command.
> >
> > On Tue, Apr 5, 2016 at 4:51 PM, Jacek Laskowski  wrote:
> >>
> >> Hi,
> >>
> >> Just checked out the latest sources and got this...
> >>
> >>
> >>
> /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:626:
> >> error: annotation argument needs to be a constant; found: "_FUNC_(str)
> >> - ".+("Returns str, with the first letter of each word in uppercase,
> >> all other letters in ").+("lowercase. Words are delimited by white
> >> space.")
> >> "Returns str, with the first letter of each word in uppercase, all
> >> other letters in " +
> >>
> >>^
> >>
> >> It's in
> >>
> https://github.com/apache/spark/commit/c59abad052b7beec4ef550049413e95578e545be
> .
> >>
> >> Is this a real issue with the build now or is this just me? I may have
> >> seen a similar case before, but can't remember what the fix was.
> >> Looking into it.
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >
>

Re: BROKEN BUILD? Is this only me or not?

2016-04-05 Thread Jacek Laskowski

Hi Ted,

This is a similar issue
https://issues.apache.org/jira/browse/SPARK-12530. I've fixed today's
one and am sending a pull req.

My build command is as follows:

./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.2 -Phive
-Phive-thriftserver -DskipTests clean install

I'm on Java 8 / Mac OS X

➜  spark git:(master) ✗ java -version
java version "1.8.0_77"
Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Apr 5, 2016 at 8:41 PM, Ted Yu  wrote:
> Looking at recent
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7
> builds, there was no such error.
> I don't see anything wrong with the code:
>
>   usage = "_FUNC_(str) - " +
> "Returns str, with the first letter of each word in uppercase, all other
> letters in " +
>
> Mind refresh and build again ?
>
> If it still fails, please share the build command.
>
> On Tue, Apr 5, 2016 at 4:51 PM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> Just checked out the latest sources and got this...
>>
>>
>> /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:626:
>> error: annotation argument needs to be a constant; found: "_FUNC_(str)
>> - ".+("Returns str, with the first letter of each word in uppercase,
>> all other letters in ").+("lowercase. Words are delimited by white
>> space.")
>> "Returns str, with the first letter of each word in uppercase, all
>> other letters in " +
>>
>>^
>>
>> It's in
>> https://github.com/apache/spark/commit/c59abad052b7beec4ef550049413e95578e545be.
>>
>> Is this a real issue with the build now or is this just me? I may have
>> seen a similar case before, but can't remember what the fix was.
>> Looking into it.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

2016-04-05 Thread Josh Rosen

I finally figured out the problem: it seems that my *export
JAVA_HOME=/path/to/java8/home* was somehow not affecting the javac
executable that Zinc's SBT incremental compiler uses when it forks out to
javac to handle Java source files. As a result, we were passing a -source
1.8 flag to the platform's default javac, which happens to be Java 7.

To fix this, I'm going to modify the build to just prepend $JAVA_HOME/bin
to $PATH while setting up the test environment

On Tue, Apr 5, 2016 at 5:09 PM Josh Rosen  wrote:

> I've reverted the bulk of the conf changes while I investigate. I think
> that Zinc might be handling JAVA_HOME in a weird way and am SSH'ing to
> Jenkins to try to reproduce the problem in isolation.
>
> On Tue, Apr 5, 2016 at 4:14 PM Ted Yu  wrote:
>
>> Josh:
>> You may have noticed the following error (
>> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/566/console
>> ):
>>
>> [error] javac: invalid source release: 1.8
>> [error] Usage: javac  
>> [error] use -help for a list of possible options
>>
>>
>> On Tue, Apr 5, 2016 at 2:14 PM, Josh Rosen 
>> wrote:
>>
>>> In order to be able to run Java 8 API compatibility tests, I'm going to
>>> push a new set of Jenkins configurations for Spark's test and PR builders
>>> so that those jobs use a Java 8 JDK. I tried this once in the past and it
>>> seemed to introduce some rare, transient flakiness in certain tests, so if
>>> anyone observes new test failures please email me and I'll investigate
>>> right away.
>>>
>>> Note that this change has no impact on Spark's supported JDK versions
>>> and our build will still target Java 7 and emit Java 7 bytecode; the
>>> purpose of this change is simply to allow the Java 8 lambda tests to be run
>>> as part of PR builder runs.
>>>
>>> - Josh
>>>
>>
>>

Re: BROKEN BUILD? Is this only me or not?

2016-04-05 Thread Ted Yu

Looking at recent
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7
builds, there was no such error.
I don't see anything wrong with the code:

  usage = "_FUNC_(str) - " +
"Returns str, with the first letter of each word in uppercase, all
other letters in " +

Mind refresh and build again ?

If it still fails, please share the build command.

On Tue, Apr 5, 2016 at 4:51 PM, Jacek Laskowski  wrote:

> Hi,
>
> Just checked out the latest sources and got this...
>
>
> /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:626:
> error: annotation argument needs to be a constant; found: "_FUNC_(str)
> - ".+("Returns str, with the first letter of each word in uppercase,
> all other letters in ").+("lowercase. Words are delimited by white
> space.")
> "Returns str, with the first letter of each word in uppercase, all
> other letters in " +
>
>^
>
> It's in
> https://github.com/apache/spark/commit/c59abad052b7beec4ef550049413e95578e545be
> .
>
> Is this a real issue with the build now or is this just me? I may have
> seen a similar case before, but can't remember what the fix was.
> Looking into it.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Holden Karau

I'm very much in favor of this, the less porting work there is the better :)

On Tue, Apr 5, 2016 at 5:32 PM, Joseph Bradley 
wrote:

> +1  By the way, the JIRA for tracking (Scala) API parity is:
> https://issues.apache.org/jira/browse/SPARK-4591
>
> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia 
> wrote:
>
>> This sounds good to me as well. The one thing we should pay attention to
>> is how we update the docs so that people know to start with the spark.ml
>> classes. Right now the docs list spark.mllib first and also seem more
>> comprehensive in that area than in spark.ml, so maybe people naturally
>> move towards that.
>>
>> Matei
>>
>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
>>
>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>> need to port over in order to reach feature parity. -Xiangrui
>>
>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> Overall this sounds good to me. One question I have is that in
>>> addition to the ML algorithms we have a number of linear algebra
>>> (various distributed matrices) and statistical methods in the
>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>> namespace in the 2.x series ?
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>> > certainly better than two.
>>> >
>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng 
>>> wrote:
>>> >> Hi all,
>>> >>
>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>> built
>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>> API has
>>> >> been developed under the spark.ml package, while the old RDD-based
>>> API has
>>> >> been developed in parallel under the spark.mllib package. While it was
>>> >> easier to implement and experiment with new APIs under a new package,
>>> it
>>> >> became harder and harder to maintain as both packages grew bigger and
>>> >> bigger. And new users are often confused by having two sets of APIs
>>> with
>>> >> overlapped functions.
>>> >>
>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>> API in
>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>> development
>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>> counting
>>> >> the lines of Scala code, from 1.5 to the current master we added
>>> ~1
>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>>> to
>>> >> gather more resources on the development of the DataFrame-based API
>>> and to
>>> >> help users migrate over sooner, I want to propose switching RDD-based
>>> MLlib
>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>> >>
>>> >> * We do not accept new features in the RDD-based spark.mllib package,
>>> unless
>>> >> they block implementing new features in the DataFrame-based spark.ml
>>> >> package.
>>> >> * We still accept bug fixes in the RDD-based API.
>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>> series to
>>> >> reach feature parity with the RDD-based API.
>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>> deprecate
>>> >> the RDD-based API.
>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>> 3.0.
>>> >>
>>> >> Though the RDD-based API is already in de facto maintenance mode, this
>>> >> announcement will make it clear and hence important to both MLlib
>>> developers
>>> >> and users. So we’d greatly appreciate your feedback!
>>> >>
>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>> >> DataFrame-based API or even the entire MLlib component. This also
>>> causes
>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>> are no
>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>> >>
>>> >> Best,
>>> >> Xiangrui
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>>>
>>
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Joseph Bradley

+1  By the way, the JIRA for tracking (Scala) API parity is:
https://issues.apache.org/jira/browse/SPARK-4591

On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia 
wrote:

> This sounds good to me as well. The one thing we should pay attention to
> is how we update the docs so that people know to start with the spark.ml
> classes. Right now the docs list spark.mllib first and also seem more
> comprehensive in that area than in spark.ml, so maybe people naturally
> move towards that.
>
> Matei
>
> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
>
> Yes, DB (cc'ed) is working on porting the local linear algebra library
> over (SPARK-13944). There are also frequent pattern mining algorithms we
> need to port over in order to reach feature parity. -Xiangrui
>
> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Overall this sounds good to me. One question I have is that in
>> addition to the ML algorithms we have a number of linear algebra
>> (various distributed matrices) and statistical methods in the
>> spark.mllib package. Is the plan to port or move these to the spark.ml
>> namespace in the 2.x series ?
>>
>> Thanks
>> Shivaram
>>
>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>> > certainly better than two.
>> >
>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  wrote:
>> >> Hi all,
>> >>
>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>> built
>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>> API has
>> >> been developed under the spark.ml package, while the old RDD-based
>> API has
>> >> been developed in parallel under the spark.mllib package. While it was
>> >> easier to implement and experiment with new APIs under a new package,
>> it
>> >> became harder and harder to maintain as both packages grew bigger and
>> >> bigger. And new users are often confused by having two sets of APIs
>> with
>> >> overlapped functions.
>> >>
>> >> We started to recommend the DataFrame-based API over the RDD-based API
>> in
>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>> development
>> >> and the usage gradually shifting to the DataFrame-based API. Just
>> counting
>> >> the lines of Scala code, from 1.5 to the current master we added ~1
>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>> to
>> >> gather more resources on the development of the DataFrame-based API
>> and to
>> >> help users migrate over sooner, I want to propose switching RDD-based
>> MLlib
>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>> >>
>> >> * We do not accept new features in the RDD-based spark.mllib package,
>> unless
>> >> they block implementing new features in the DataFrame-based spark.ml
>> >> package.
>> >> * We still accept bug fixes in the RDD-based API.
>> >> * We will add more features to the DataFrame-based API in the 2.x
>> series to
>> >> reach feature parity with the RDD-based API.
>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>> deprecate
>> >> the RDD-based API.
>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>> 3.0.
>> >>
>> >> Though the RDD-based API is already in de facto maintenance mode, this
>> >> announcement will make it clear and hence important to both MLlib
>> developers
>> >> and users. So we’d greatly appreciate your feedback!
>> >>
>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>> >> DataFrame-based API or even the entire MLlib component. This also
>> causes
>> >> confusion. To be clear, “Spark ML” is not an official name and there
>> are no
>> >> plans to rename MLlib to “Spark ML” at this time.)
>> >>
>> >> Best,
>> >> Xiangrui
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>
>

Re: Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

2016-04-05 Thread Josh Rosen

I've reverted the bulk of the conf changes while I investigate. I think
that Zinc might be handling JAVA_HOME in a weird way and am SSH'ing to
Jenkins to try to reproduce the problem in isolation.

On Tue, Apr 5, 2016 at 4:14 PM Ted Yu  wrote:

> Josh:
> You may have noticed the following error (
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/566/console
> ):
>
> [error] javac: invalid source release: 1.8
> [error] Usage: javac  
> [error] use -help for a list of possible options
>
>
> On Tue, Apr 5, 2016 at 2:14 PM, Josh Rosen 
> wrote:
>
>> In order to be able to run Java 8 API compatibility tests, I'm going to
>> push a new set of Jenkins configurations for Spark's test and PR builders
>> so that those jobs use a Java 8 JDK. I tried this once in the past and it
>> seemed to introduce some rare, transient flakiness in certain tests, so if
>> anyone observes new test failures please email me and I'll investigate
>> right away.
>>
>> Note that this change has no impact on Spark's supported JDK versions and
>> our build will still target Java 7 and emit Java 7 bytecode; the purpose of
>> this change is simply to allow the Java 8 lambda tests to be run as part of
>> PR builder runs.
>>
>> - Josh
>>
>
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Matei Zaharia

This sounds good to me as well. The one thing we should pay attention to is how 
we update the docs so that people know to start with the spark.ml classes. 
Right now the docs list spark.mllib first and also seem more comprehensive in 
that area than in spark.ml, so maybe people naturally move towards that.

Matei

> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
> 
> Yes, DB (cc'ed) is working on porting the local linear algebra library over 
> (SPARK-13944). There are also frequent pattern mining algorithms we need to 
> port over in order to reach feature parity. -Xiangrui
> 
> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman 
> > wrote:
> Overall this sounds good to me. One question I have is that in
> addition to the ML algorithms we have a number of linear algebra
> (various distributed matrices) and statistical methods in the
> spark.mllib package. Is the plan to port or move these to the spark.ml 
> 
> namespace in the 2.x series ?
> 
> Thanks
> Shivaram
> 
> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  > wrote:
> > FWIW, all of that sounds like a good plan to me. Developing one API is
> > certainly better than two.
> >
> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  > > wrote:
> >> Hi all,
> >>
> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API 
> >> has
> >> been developed under the spark.ml  package, while the 
> >> old RDD-based API has
> >> been developed in parallel under the spark.mllib package. While it was
> >> easier to implement and experiment with new APIs under a new package, it
> >> became harder and harder to maintain as both packages grew bigger and
> >> bigger. And new users are often confused by having two sets of APIs with
> >> overlapped functions.
> >>
> >> We started to recommend the DataFrame-based API over the RDD-based API in
> >> Spark 1.5 for its versatility and flexibility, and we saw the development
> >> and the usage gradually shifting to the DataFrame-based API. Just counting
> >> the lines of Scala code, from 1.5 to the current master we added ~1
> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> >> gather more resources on the development of the DataFrame-based API and to
> >> help users migrate over sooner, I want to propose switching RDD-based MLlib
> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
> >>
> >> * We do not accept new features in the RDD-based spark.mllib package, 
> >> unless
> >> they block implementing new features in the DataFrame-based spark.ml 
> >> 
> >> package.
> >> * We still accept bug fixes in the RDD-based API.
> >> * We will add more features to the DataFrame-based API in the 2.x series to
> >> reach feature parity with the RDD-based API.
> >> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
> >> the RDD-based API.
> >> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
> >>
> >> Though the RDD-based API is already in de facto maintenance mode, this
> >> announcement will make it clear and hence important to both MLlib 
> >> developers
> >> and users. So we’d greatly appreciate your feedback!
> >>
> >> (As a side note, people sometimes use “Spark ML” to refer to the
> >> DataFrame-based API or even the entire MLlib component. This also causes
> >> confusion. To be clear, “Spark ML” is not an official name and there are no
> >> plans to rename MLlib to “Spark ML” at this time.)
> >>
> >> Best,
> >> Xiangrui
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > 
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > 
> >

BROKEN BUILD? Is this only me or not?

2016-04-05 Thread Jacek Laskowski

Hi,

Just checked out the latest sources and got this...

/Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:626:
error: annotation argument needs to be a constant; found: "_FUNC_(str)
- ".+("Returns str, with the first letter of each word in uppercase,
all other letters in ").+("lowercase. Words are delimited by white
space.")
"Returns str, with the first letter of each word in uppercase, all
other letters in " +

   ^

It's in 
https://github.com/apache/spark/commit/c59abad052b7beec4ef550049413e95578e545be.

Is this a real issue with the build now or is this just me? I may have
seen a similar case before, but can't remember what the fix was.
Looking into it.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [STREAMING] DStreamClosureSuite.scala with { return; ssc.sparkContext.emptyRDD[Int] } Why?!

2016-04-05 Thread Jacek Laskowski

Hi Ted,

Yeah, I saw the line, but forgot it's a test that may have been
testing that closures should not have return. More clear now. Thanks!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Apr 5, 2016 at 6:47 PM, Ted Yu  wrote:
> The next line should give some clue:
> expectCorrectException { ssc.transform(Seq(ds), transformF) }
>
> Closure shouldn't include return.
>
> On Tue, Apr 5, 2016 at 3:40 PM, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> In
>> https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/DStreamClosureSuite.scala#L190:
>>
>> { return; ssc.sparkContext.emptyRDD[Int] }
>>
>> What is this return inside for? I don't understand the line and am
>> about to propose a change to remove it.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-05 Thread Holden Karau

One minor downside to having both 2.10 and 2.11 (and eventually 2.12) is
deprecation warnings in our builds that we can't fix without introducing a
wrapper/ scala version specific code. This isn't a big deal, and if we drop
2.10 in the 3-6 month time frame talked about we can cleanup those warnings
once we get there.

On Fri, Apr 1, 2016 at 10:00 PM, Raymond Honderdors <
raymond.honderd...@sizmek.com> wrote:

> What about a seperate branch for scala 2.10?
>
>
>
> Sent from my Samsung Galaxy smartphone.
>
>
>  Original message 
> From: Koert Kuipers 
> Date: 4/2/2016 02:10 (GMT+02:00)
> To: Michael Armbrust 
> Cc: Matei Zaharia , Mark Hamstra <
> m...@clearstorydata.com>, Cody Koeninger , Sean Owen <
> so...@cloudera.com>, dev@spark.apache.org
> Subject: Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle
>
> as long as we don't lock ourselves into supporting scala 2.10 for the
> entire spark 2 lifespan it sounds reasonable to me
>
> On Wed, Mar 30, 2016 at 3:25 PM, Michael Armbrust 
> wrote:
>
>> +1 to Matei's reasoning.
>>
>> On Wed, Mar 30, 2016 at 9:21 AM, Matei Zaharia 
>> wrote:
>>
>>> I agree that putting it in 2.0 doesn't mean keeping Scala 2.10 for the
>>> entire 2.x line. My vote is to keep Scala 2.10 in Spark 2.0, because it's
>>> the default version we built with in 1.x. We want to make the transition
>>> from 1.x to 2.0 as easy as possible. In 2.0, we'll have the default
>>> downloads be for Scala 2.11, so people will more easily move, but we
>>> shouldn't create obstacles that lead to fragmenting the community and
>>> slowing down Spark 2.0's adoption. I've seen companies that stayed on an
>>> old Scala version for multiple years because switching it, or mixing
>>> versions, would affect the company's entire codebase.
>>>
>>> Matei
>>>
>>> On Mar 30, 2016, at 12:08 PM, Koert Kuipers  wrote:
>>>
>>> oh wow, had no idea it got ripped out
>>>
>>> On Wed, Mar 30, 2016 at 11:50 AM, Mark Hamstra 
>>> wrote:
>>>
 No, with 2.0 Spark really doesn't use Akka:
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L744

 On Wed, Mar 30, 2016 at 9:10 AM, Koert Kuipers 
 wrote:

> Spark still runs on akka. So if you want the benefits of the latest
> akka (not saying we do, was just an example) then you need to drop scala
> 2.10
> On Mar 30, 2016 10:44 AM, "Cody Koeninger"  wrote:
>
>> I agree with Mark in that I don't see how supporting scala 2.10 for
>> spark 2.0 implies supporting it for all of spark 2.x
>>
>> Regarding Koert's comment on akka, I thought all akka dependencies
>> have been removed from spark after SPARK-7997 and the recent removal
>> of external/akka
>>
>> On Wed, Mar 30, 2016 at 9:36 AM, Mark Hamstra <
>> m...@clearstorydata.com> wrote:
>> > Dropping Scala 2.10 support has to happen at some point, so I'm not
>> > fundamentally opposed to the idea; but I've got questions about how
>> we go
>> > about making the change and what degree of negative consequences we
>> are
>> > willing to accept.  Until now, we have been saying that 2.10
>> support will be
>> > continued in Spark 2.0.0.  Switching to 2.11 will be non-trivial
>> for some
>> > Spark users, so abruptly dropping 2.10 support is very likely to
>> delay
>> > migration to Spark 2.0 for those users.
>> >
>> > What about continuing 2.10 support in 2.0.x, but repeatedly making
>> an
>> > obvious announcement in multiple places that such support is
>> deprecated,
>> > that we are not committed to maintaining it throughout 2.x, and
>> that it is,
>> > in fact, scheduled to be removed in 2.1.0?
>> >
>> > On Wed, Mar 30, 2016 at 7:45 AM, Sean Owen 
>> wrote:
>> >>
>> >> (This should fork as its own thread, though it began during
>> discussion
>> >> of whether to continue Java 7 support in Spark 2.x.)
>> >>
>> >> Simply: would like to more clearly take the temperature of all
>> >> interested parties about whether to support Scala 2.10 in the Spark
>> >> 2.x lifecycle. Some of the arguments appear to be:
>> >>
>> >> Pro
>> >> - Some third party dependencies do not support Scala 2.11+ yet and
>> so
>> >> would not be usable in a Spark app
>> >>
>> >> Con
>> >> - Lower maintenance overhead -- no separate 2.10 build,
>> >> cross-building, tests to check, esp considering support of 2.12
>> will
>> >> be needed
>> >> - Can use 2.11+ features freely
>> >> - 2.10 was EOL in late 2014 and Spark 2.x lifecycle is years to
>> come
>> >>
>> >> I would like to not support 2.10 for

Re: Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

2016-04-05 Thread Ted Yu

Josh:
You may have noticed the following error (
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/566/console
):

[error] javac: invalid source release: 1.8
[error] Usage: javac  
[error] use -help for a list of possible options


On Tue, Apr 5, 2016 at 2:14 PM, Josh Rosen  wrote:

> In order to be able to run Java 8 API compatibility tests, I'm going to
> push a new set of Jenkins configurations for Spark's test and PR builders
> so that those jobs use a Java 8 JDK. I tried this once in the past and it
> seemed to introduce some rare, transient flakiness in certain tests, so if
> anyone observes new test failures please email me and I'll investigate
> right away.
>
> Note that this change has no impact on Spark's supported JDK versions and
> our build will still target Java 7 and emit Java 7 bytecode; the purpose of
> this change is simply to allow the Java 8 lambda tests to be run as part of
> PR builder runs.
>
> - Josh
>

Re: [STREAMING] DStreamClosureSuite.scala with { return; ssc.sparkContext.emptyRDD[Int] } Why?!

2016-04-05 Thread Ted Yu

The next line should give some clue:
expectCorrectException { ssc.transform(Seq(ds), transformF) }

Closure shouldn't include return.

On Tue, Apr 5, 2016 at 3:40 PM, Jacek Laskowski  wrote:

> Hi,
>
> In
> https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/DStreamClosureSuite.scala#L190
> :
>
> { return; ssc.sparkContext.emptyRDD[Int] }
>
> What is this return inside for? I don't understand the line and am
> about to propose a change to remove it.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

[STREAMING] DStreamClosureSuite.scala with { return; ssc.sparkContext.emptyRDD[Int] } Why?!

2016-04-05 Thread Jacek Laskowski

Hi,

In 
https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/DStreamClosureSuite.scala#L190:

{ return; ssc.sparkContext.emptyRDD[Int] }

What is this return inside for? I don't understand the line and am
about to propose a change to remove it.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

2016-04-05 Thread Josh Rosen

In order to be able to run Java 8 API compatibility tests, I'm going to
push a new set of Jenkins configurations for Spark's test and PR builders
so that those jobs use a Java 8 JDK. I tried this once in the past and it
seemed to introduce some rare, transient flakiness in certain tests, so if
anyone observes new test failures please email me and I'll investigate
right away.

Note that this change has no impact on Spark's supported JDK versions and
our build will still target Java 7 and emit Java 7 bytecode; the purpose of
this change is simply to allow the Java 8 lambda tests to be run as part of
PR builder runs.

- Josh

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Xiangrui Meng

Yes, DB (cc'ed) is working on porting the local linear algebra library over
(SPARK-13944). There are also frequent pattern mining algorithms we need to
port over in order to reach feature parity. -Xiangrui

On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Overall this sounds good to me. One question I have is that in
> addition to the ML algorithms we have a number of linear algebra
> (various distributed matrices) and statistical methods in the
> spark.mllib package. Is the plan to port or move these to the spark.ml
> namespace in the 2.x series ?
>
> Thanks
> Shivaram
>
> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
> > FWIW, all of that sounds like a good plan to me. Developing one API is
> > certainly better than two.
> >
> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  wrote:
> >> Hi all,
> >>
> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
> built
> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
> API has
> >> been developed under the spark.ml package, while the old RDD-based API
> has
> >> been developed in parallel under the spark.mllib package. While it was
> >> easier to implement and experiment with new APIs under a new package, it
> >> became harder and harder to maintain as both packages grew bigger and
> >> bigger. And new users are often confused by having two sets of APIs with
> >> overlapped functions.
> >>
> >> We started to recommend the DataFrame-based API over the RDD-based API
> in
> >> Spark 1.5 for its versatility and flexibility, and we saw the
> development
> >> and the usage gradually shifting to the DataFrame-based API. Just
> counting
> >> the lines of Scala code, from 1.5 to the current master we added ~1
> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> >> gather more resources on the development of the DataFrame-based API and
> to
> >> help users migrate over sooner, I want to propose switching RDD-based
> MLlib
> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
> >>
> >> * We do not accept new features in the RDD-based spark.mllib package,
> unless
> >> they block implementing new features in the DataFrame-based spark.ml
> >> package.
> >> * We still accept bug fixes in the RDD-based API.
> >> * We will add more features to the DataFrame-based API in the 2.x
> series to
> >> reach feature parity with the RDD-based API.
> >> * Once we reach feature parity (possibly in Spark 2.2), we will
> deprecate
> >> the RDD-based API.
> >> * We will remove the RDD-based API from the main Spark repo in Spark
> 3.0.
> >>
> >> Though the RDD-based API is already in de facto maintenance mode, this
> >> announcement will make it clear and hence important to both MLlib
> developers
> >> and users. So we’d greatly appreciate your feedback!
> >>
> >> (As a side note, people sometimes use “Spark ML” to refer to the
> >> DataFrame-based API or even the entire MLlib component. This also causes
> >> confusion. To be clear, “Spark ML” is not an official name and there
> are no
> >> plans to rename MLlib to “Spark ML” at this time.)
> >>
> >> Best,
> >> Xiangrui
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Shivaram Venkataraman

Overall this sounds good to me. One question I have is that in
addition to the ML algorithms we have a number of linear algebra
(various distributed matrices) and statistical methods in the
spark.mllib package. Is the plan to port or move these to the spark.ml
namespace in the 2.x series ?

Thanks
Shivaram

On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
> FWIW, all of that sounds like a good plan to me. Developing one API is
> certainly better than two.
>
> On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  wrote:
>> Hi all,
>>
>> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
>> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has
>> been developed under the spark.ml package, while the old RDD-based API has
>> been developed in parallel under the spark.mllib package. While it was
>> easier to implement and experiment with new APIs under a new package, it
>> became harder and harder to maintain as both packages grew bigger and
>> bigger. And new users are often confused by having two sets of APIs with
>> overlapped functions.
>>
>> We started to recommend the DataFrame-based API over the RDD-based API in
>> Spark 1.5 for its versatility and flexibility, and we saw the development
>> and the usage gradually shifting to the DataFrame-based API. Just counting
>> the lines of Scala code, from 1.5 to the current master we added ~1
>> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
>> gather more resources on the development of the DataFrame-based API and to
>> help users migrate over sooner, I want to propose switching RDD-based MLlib
>> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>
>> * We do not accept new features in the RDD-based spark.mllib package, unless
>> they block implementing new features in the DataFrame-based spark.ml
>> package.
>> * We still accept bug fixes in the RDD-based API.
>> * We will add more features to the DataFrame-based API in the 2.x series to
>> reach feature parity with the RDD-based API.
>> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
>> the RDD-based API.
>> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
>>
>> Though the RDD-based API is already in de facto maintenance mode, this
>> announcement will make it clear and hence important to both MLlib developers
>> and users. So we’d greatly appreciate your feedback!
>>
>> (As a side note, people sometimes use “Spark ML” to refer to the
>> DataFrame-based API or even the entire MLlib component. This also causes
>> confusion. To be clear, “Spark ML” is not an official name and there are no
>> plans to rename MLlib to “Spark ML” at this time.)
>>
>> Best,
>> Xiangrui
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Sean Owen

FWIW, all of that sounds like a good plan to me. Developing one API is
certainly better than two.

On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  wrote:
> Hi all,
>
> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has
> been developed under the spark.ml package, while the old RDD-based API has
> been developed in parallel under the spark.mllib package. While it was
> easier to implement and experiment with new APIs under a new package, it
> became harder and harder to maintain as both packages grew bigger and
> bigger. And new users are often confused by having two sets of APIs with
> overlapped functions.
>
> We started to recommend the DataFrame-based API over the RDD-based API in
> Spark 1.5 for its versatility and flexibility, and we saw the development
> and the usage gradually shifting to the DataFrame-based API. Just counting
> the lines of Scala code, from 1.5 to the current master we added ~1
> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> gather more resources on the development of the DataFrame-based API and to
> help users migrate over sooner, I want to propose switching RDD-based MLlib
> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>
> * We do not accept new features in the RDD-based spark.mllib package, unless
> they block implementing new features in the DataFrame-based spark.ml
> package.
> * We still accept bug fixes in the RDD-based API.
> * We will add more features to the DataFrame-based API in the 2.x series to
> reach feature parity with the RDD-based API.
> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
> the RDD-based API.
> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
>
> Though the RDD-based API is already in de facto maintenance mode, this
> announcement will make it clear and hence important to both MLlib developers
> and users. So we’d greatly appreciate your feedback!
>
> (As a side note, people sometimes use “Spark ML” to refer to the
> DataFrame-based API or even the entire MLlib component. This also causes
> confusion. To be clear, “Spark ML” is not an official name and there are no
> plans to rename MLlib to “Spark ML” at this time.)
>
> Best,
> Xiangrui

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Spark Streaming UI reporting a different task duration

2016-04-05 Thread Renyi Xiong

Hi TD,

We noticed that Spark Streaming UI is reporting a different task duration
from time to time.

e.g. here's the standard output of the application which reports the
duration of the longest task is about 3.3 minutes:

16/04/01 16:07:19 INFO TaskSetManager: Finished task 1077.0 in stage 0.0
(TID 1077) in 125425 ms on CH1SCH080051460 (1562/1563)

16/04/01 16:08:30 INFO TaskSetManager: Finished task 926.0 in stage 0.0
(TID 926) in 196776 ms on CH1SCH080100841 (1563/1563)


but on spark streaming UI it's about 2.3 minutes




*Summary Metrics for 1563 Completed Tasks*


*Metric*

*Min*

*25th percentile*

*Median*

*75th percentile*

*Max*

Duration

12 s

21 s

24 s

29 s

2.3 min

GC Time

0 ms

0 ms

0 ms

0 ms

62 ms

Shuffle Write Size / Records

0.0 B / 0

532.9 KB / 131

1198.9 KB / 299

3.8 MB / 1161

9.0 MB / 2865

I wonder if you have any quick idea about where the missing 1 minute could
be?


thanks a lot,

Renyi.

Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Xiangrui Meng

Hi all,

More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API
has been developed under the spark.ml package, while the old RDD-based API
has been developed in parallel under the spark.mllib package. While it was
easier to implement and experiment with new APIs under a new package, it
became harder and harder to maintain as both packages grew bigger and
bigger. And new users are often confused by having two sets of APIs with
overlapped functions.

We started to recommend the DataFrame-based API over the RDD-based API in
Spark 1.5 for its versatility and flexibility, and we saw the development
and the usage gradually shifting to the DataFrame-based API. Just counting
the lines of Scala code, from 1.5 to the current master we added ~1
lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
gather more resources on the development of the DataFrame-based API and to
help users migrate over sooner, I want to propose switching RDD-based MLlib
APIs to maintenance mode in Spark 2.0. What does it mean exactly?

* We do not accept new features in the RDD-based spark.mllib package,
unless they block implementing new features in the DataFrame-based spark.ml
package.
* We still accept bug fixes in the RDD-based API.
* We will add more features to the DataFrame-based API in the 2.x series to
reach feature parity with the RDD-based API.
* Once we reach feature parity (possibly in Spark 2.2), we will deprecate
the RDD-based API.
* We will remove the RDD-based API from the main Spark repo in Spark 3.0.

Though the RDD-based API is already in de facto maintenance mode, this
announcement will make it clear and hence important to both MLlib
developers and users. So we’d greatly appreciate your feedback!

(As a side note, people sometimes use “Spark ML” to refer to the
DataFrame-based API or even the entire MLlib component. This also causes
confusion. To be clear, “Spark ML” is not an official name and there are no
plans to rename MLlib to “Spark ML” at this time.)

Best,
Xiangrui

Re: Build with Thrift Server & Scala 2.11

2016-04-05 Thread Raymond Honderdors

I did a check of that i could not find that in any of the config files

I also used config files that work with 1.6.1

Sent from Outlook Mobile

On Tue, Apr 5, 2016 at 9:22 AM -0700, "Ted Yu" 
> wrote:

Raymond:

Did "namenode" appear in any of the Spark config files ?

BTW Scala 2.11 is used by the default build.

On Tue, Apr 5, 2016 at 6:22 AM, Raymond Honderdors 
> wrote:
I can see that the build is successful
(-Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver 
-Dscala-2.11 -DskipTests clean package)

the documents page it still says that
"
Building With Hive and JDBC Support
To enable Hive integration for Spark SQL along with its JDBC server and CLI, 
add the -Phive and Phive-thriftserver profiles to your existing build options. 
By default Spark will build with Hive 0.13.1 bindings.

# Apache Hadoop 2.4.X with Hive 13 support
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver 
-DskipTests clean package
Building for Scala 2.11
To produce a Spark package compiled with Scala 2.11, use the -Dscala-2.11 
property:

./dev/change-scala-version.sh 2.11
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
Spark does not yet support its JDBC component for Scala 2.11.
"
Source : http://spark.apache.org/docs/latest/building-spark.html

When I try to start the thrift server I get the following error:
"
16/04/05 16:09:11 INFO BlockManagerMaster: Registered BlockManager
16/04/05 16:09:12 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.UnknownHostException: namenode
at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at 
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at 
org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1667)
at 
org.apache.spark.scheduler.EventLoggingListener.(EventLoggingListener.scala:67)
at org.apache.spark.SparkContext.(SparkContext.scala:517)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:57)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:77)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:726)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at 
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.UnknownHostException: namenode
... 26 more
16/04/05 16:09:12 INFO SparkUI: Stopped Spark web UI at 
http://10.10.182.195:4040
16/04/05 16:09:12 INFO SparkDeploySchedulerBackend: Shutting down all executors
"

Raymond Honderdors
Team Lead Analytics BI
Business Intelligence Developer
raymond.honderd...@sizmek.com
T +972.7325.3569
Herzliya

From: Reynold Xin [mailto:r...@databricks.com]
Sent: Tuesday, April 05, 2016 3:57 PM
To: Raymond Honderdors 
>
Cc: dev@spark.apache.org
Subject: Re: Build with Thrift Server & Scala 2.11

What do you mean? The Jenkins build for Spark uses 2.11 and also builds the 
thrift server.

On Tuesday, April 5, 2016, Raymond Honderdors

Re: Build with Thrift Server & Scala 2.11

2016-04-05 Thread Ted Yu

Raymond:

Did "namenode" appear in any of the Spark config files ?

BTW Scala 2.11 is used by the default build.

On Tue, Apr 5, 2016 at 6:22 AM, Raymond Honderdors <
raymond.honderd...@sizmek.com> wrote:

> I can see that the build is successful
>
> (-Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
> –Dscala-2.11 -DskipTests clean package)
>
>
>
> the documents page it still says that
>
> “
>
> Building With Hive and JDBC Support
>
> To enable Hive integration for Spark SQL along with its JDBC server and
> CLI, add the -Phive and Phive-thriftserver profiles to your existing build
> options. By default Spark will build with Hive 0.13.1 bindings.
>
>
>
> # Apache Hadoop 2.4.X with Hive 13 support
>
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver
> -DskipTests clean package
>
> Building for Scala 2.11
>
> To produce a Spark package compiled with Scala 2.11, use the -Dscala-2.11
> property:
>
>
>
> ./dev/change-scala-version.sh 2.11
>
> mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
>
> Spark does not yet support its JDBC component for Scala 2.11.
>
> ”
>
> Source : http://spark.apache.org/docs/latest/building-spark.html
>
>
>
> When I try to start the thrift server I get the following error:
>
> “
>
> 16/04/05 16:09:11 INFO BlockManagerMaster: Registered BlockManager
>
> 16/04/05 16:09:12 ERROR SparkContext: Error initializing SparkContext.
>
> java.lang.IllegalArgumentException: java.net.UnknownHostException: namenode
>
> at
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
>
> at
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
>
> at
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
>
> at
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
>
> at
> org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
>
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
>
> at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
>
> at
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>
> at
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
>
> at
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>
> at
> org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1667)
>
> at
> org.apache.spark.scheduler.EventLoggingListener.(EventLoggingListener.scala:67)
>
> at
> org.apache.spark.SparkContext.(SparkContext.scala:517)
>
> at
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:57)
>
> at
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:77)
>
> at
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
>
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:726)
>
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
>
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
>
> at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
>
> at
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Caused by: java.net.UnknownHostException: namenode
>
> ... 26 more
>
> 16/04/05 16:09:12 INFO SparkUI: Stopped Spark web UI at
> http://10.10.182.195:4040
>
> 16/04/05 16:09:12 INFO SparkDeploySchedulerBackend: Shutting down all
> executors
>
> ”
>
>
>
>
>
>
>
> *Raymond Honderdors *
>
> *Team Lead Analytics BI*
>
> *Business Intelligence Developer *
>
> *raymond.honderd...@sizmek.com  *
>
> *T +972.7325.3569*
>
> *Herzliya*
>
>
>
> *From:* Reynold Xin [mailto:r...@databricks.com]
> *Sent:* Tuesday, April 05, 2016 3:57 PM
> *To:* Raymond Honderdors 
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Build with Thrift Server & Scala 2.11
>
>
>
> What do you mean? The Jenkins build for Spark uses 2.11 and also builds
> the thrift server.
>
> On Tuesday, April 5, 2016, Raymond Honderdors <
> raymond.honderd...@sizmek.com> wrote:
>
> Is anyone looking into this one, Build with Thrift Server &

Re: RDD Partitions not distributed evenly to executors

2016-04-05 Thread Khaled Ammar

I have a similar experience.

Using 32 machines, I can see than number of tasks (partitions) assigned to
executors (machines) is not even. Moreover, the distribution change every
stage (iteration).

I wonder why Spark needs to move partitions around any way, should not the
scheduler reduce network (and other IO) overhead by reducing such
relocation.

Thanks,
-Khaled




On Mon, Apr 4, 2016 at 10:57 PM, Koert Kuipers  wrote:

> can you try:
> spark.shuffle.reduceLocality.enabled=false
>
> On Mon, Apr 4, 2016 at 8:17 PM, Mike Hynes <91m...@gmail.com> wrote:
>
>> Dear all,
>>
>> Thank you for your responses.
>>
>> Michael Slavitch:
>> > Just to be sure:  Has spark-env.sh and spark-defaults.conf been
>> correctly propagated to all nodes?  Are they identical?
>> Yes; these files are stored on a shared memory directory accessible to
>> all nodes.
>>
>> Koert Kuipers:
>> > we ran into similar issues and it seems related to the new memory
>> > management. can you try:
>> > spark.memory.useLegacyMode = true
>> I reran the exact same code with a restarted cluster using this
>> modification, and did not observe any difference. The partitioning is
>> still imbalanced.
>>
>> Ted Yu:
>> > If the changes can be ported over to 1.6.1, do you mind reproducing the
>> issue there ?
>> Since the spark.memory.useLegacyMode setting did not impact my code
>> execution, I will have to change the Spark dependency back to earlier
>> versions to see if the issue persists and get back to you.
>>
>> Meanwhile, if anyone else has any other ideas or experience, please let
>> me know.
>>
>> Mike
>>
>> On 4/4/16, Koert Kuipers  wrote:
>> > we ran into similar issues and it seems related to the new memory
>> > management. can you try:
>> > spark.memory.useLegacyMode = true
>> >
>> > On Mon, Apr 4, 2016 at 9:12 AM, Mike Hynes <91m...@gmail.com> wrote:
>> >
>> >> [ CC'ing dev list since nearly identical questions have occurred in
>> >> user list recently w/o resolution;
>> >> c.f.:
>> >>
>> >>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html
>> >>
>> >>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html
>> >> ]
>> >>
>> >> Hello,
>> >>
>> >> In short, I'm reporting a problem concerning load imbalance of RDD
>> >> partitions across a standalone cluster. Though there are 16 cores
>> >> available per node, certain nodes will have >16 partitions, and some
>> >> will correspondingly have <16 (and even 0).
>> >>
>> >> In more detail: I am running some scalability/performance tests for
>> >> vector-type operations. The RDDs I'm considering are simple block
>> >> vectors of type RDD[(Int,Vector)] for a Breeze vector type. The RDDs
>> >> are generated with a fixed number of elements given by some multiple
>> >> of the available cores, and subsequently hash-partitioned by their
>> >> integer block index.
>> >>
>> >> I have verified that the hash partitioning key distribution, as well
>> >> as the keys themselves, are both correct; the problem is truly that
>> >> the partitions are *not* evenly distributed across the nodes.
>> >>
>> >> For instance, here is a representative output for some stages and
>> >> tasks in an iterative program. This is a very simple test with 2
>> >> nodes, 64 partitions, 32 cores (16 per node), and 2 executors. Two
>> >> examples stages from the stderr log are stages 7 and 9:
>> >> 7,mapPartitions at DummyVector.scala:113,64,1459771364404,1459771365272
>> >> 9,mapPartitions at DummyVector.scala:113,64,1459771364431,1459771365639
>> >>
>> >> When counting the location of the partitions on the compute nodes from
>> >> the stderr logs, however, you can clearly see the imbalance. Examples
>> >> lines are:
>> >> 13627 task 0.0 in stage 7.0 (TID 196,
>> >> himrod-2, partition 0,PROCESS_LOCAL, 3987 bytes)&
>> >> 13628 task 1.0 in stage 7.0 (TID 197,
>> >> himrod-2, partition 1,PROCESS_LOCAL, 3987 bytes)&
>> >> 13629 task 2.0 in stage 7.0 (TID 198,
>> >> himrod-2, partition 2,PROCESS_LOCAL, 3987 bytes)&
>> >>
>> >> Grep'ing the full set of above lines for each hostname, himrod-?,
>> >> shows the problem occurs in each stage. Below is the output, where the
>> >> number of partitions stored on each node is given alongside its
>> >> hostname as in (himrod-?,num_partitions):
>> >> Stage 7: (himrod-1,0) (himrod-2,64)
>> >> Stage 9: (himrod-1,16) (himrod-2,48)
>> >> Stage 12: (himrod-1,0) (himrod-2,64)
>> >> Stage 14: (himrod-1,16) (himrod-2,48)
>> >> The imbalance is also visible when the executor ID is used to count
>> >> the partitions operated on by executors.
>> >>
>> >> I am working off a fairly recent modification of 2.0.0-SNAPSHOT branch
>> >> (but the modifications do not touch the scheduler, and are irrelevant
>> >> for these particular tests). Has something changed radically in 1.6+
>> >> that would make a previously (<=1.5) correct configuration go haywire?
>> >> Have new

Re: What influences the space complexity of Spark operations?

2016-04-05 Thread Steve Johnston

Submitted: SPARK-14389 - OOM during BroadcastNestedLoopJoin.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-influences-the-space-complexity-of-Spark-operations-tp16944p17029.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-04-05 Thread Reynold Xin

Hi Sean,

See http://www.oracle.com/technetwork/java/eol-135779.html

Java 7 hasn't EOLed yet. If you look at support you can get from Oracle,
it's actually goes to 2019. And you can even get more support after that.

Spark has always maintained great backward compatibility with other
systems, way beyond what vendors typically support. For example, we
supported Hadoop 1.x all the way until Spark 1.6 (basically the last
release), while all the vendors have dropped support for them already.

Putting my Databricks hat on we actually only support Java 8, but I think
it would be great to still support Java 7 in the upstream release for some
larger deployments. I like the idea of deprecating or at least strongly
encouraging people to update.

On Tuesday, April 5, 2016, Sean Owen  wrote:

> Following
> https://github.com/apache/spark/pull/12165#issuecomment-205791222
> I'd like to make a point about process and then answer points below.
>
> We have this funny system where anyone can propose a change, and any
> of a few people can veto a change unilaterally. The latter rarely
> comes up. 9 changes out of 10 nobody disagrees on; sometimes a
> committer will say 'no' to a change and nobody else with that bit
> disagrees.
>
> Sometimes it matters and here I see, what, 4 out of 5 people including
> committers supporting a particular change. A veto to oppose that is
> pretty drastic. It's not something to use because you or customers
> prefer a certain outcome. This reads like you're informing people
> you've changed your mind and that's the decision, when it can't work
> that way. I saw this happen to a lesser extent in the thread about
> Scala 2.10.
>
> It doesn't mean majority rules here either, but can I suggest you
> instead counter-propose an outcome that the people here voting in
> favor of what you're vetoing would probably also buy into? I bet
> everyone's willing to give wide accommodation to your concerns. It's
> probably not hard, like: let's plan to not support Java 7 in Spark
> 2.1.0. (Then we can debate the logic of that.)
>
> On Mon, Apr 4, 2016 at 6:28 AM, Reynold Xin  > wrote:
> > some technology companies, are still using Java 7. One thing is that up
> > until this date, users still can't install openjdk 8 on Ubuntu by
> default. I
> > see that as an indication that it is too early to drop Java 7.
>
> I have Java 8 on my Ubuntu instance, and installed it directly via apt-get.
> http://openjdk.java.net/install/
>
>
> > Looking at the timeline, JDK release a major new version roughly every 3
> > years. We dropped Java 6 support one year ago, so from a timeline point
> of
> > view we would be very aggressive here if we were to drop Java 7 support
> in
> > Spark 2.0.
>
> The metric is really (IMHO) when the JDK goes EOL. Java 6 was EOL in
> Feb 2013, so supporting it into Spark 1.x was probably too long. Java
> 7 was EOL in April 2015. It's not really somehow every ~3 years.
>
>
> > Note that not dropping Java 7 support now doesn't mean we have to support
> > Java 7 throughout Spark 2.x. We dropped Java 6 support in Spark 1.5, even
> > though Spark 1.0 started with Java 6.
>
> Whatever arguments one has about preventing people from updating to
> the latest and greatest then apply to a *minor* release, which is
> worse. Java 6 support was probably overdue for removal at 1.0;
> better-late-than-never, not necessarily the right time to do it.
>
>
> > In terms of testing, Josh has actually improved our test infra so now we
> > would run the Java 8 tests: https://github.com/apache/spark/pull/12073
>
> Excellent, but, orthogonal.
>
> Even if I personally don't see the merit in these arguments compared
> to the counter-arguments, retaining Java 7 support now wouldn't be a
> terrible outcome. I'd like to see better process and a more reasonable
> compromise result though.
>

RE: Build with Thrift Server & Scala 2.11

2016-04-05 Thread Raymond Honderdors

Here is the error after build with scala 2.10
“
Spark Command: /usr/lib/jvm/java-1.8.0/bin/java -cp 
/home/raymond.honderdors/Documents/IdeaProjects/spark/conf/:/home/raymond.honderdors/Documents/IdeaProjects/spark/assembly/target/scala-2.10/jars/*
 -Xms5g -Xmx5g org.apache.spark.deploy.SparkSubmit --class 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal

Exception in thread "main" java.lang.IncompatibleClassChangeError: Implementing 
class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:710)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at 
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
”



Raymond Honderdors
Team Lead Analytics BI
Business Intelligence Developer
raymond.honderd...@sizmek.com
T +972.7325.3569
Herzliya

From: Raymond Honderdors [mailto:raymond.honderd...@sizmek.com]
Sent: Tuesday, April 05, 2016 4:23 PM
To: Reynold Xin 
Cc: dev@spark.apache.org
Subject: RE: Build with Thrift Server & Scala 2.11

I can see that the build is successful
(-Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver 
–Dscala-2.11 -DskipTests clean package)

the documents page it still says that
“
Building With Hive and JDBC Support
To enable Hive integration for Spark SQL along with its JDBC server and CLI, 
add the -Phive and Phive-thriftserver profiles to your existing build options. 
By default Spark will build with Hive 0.13.1 bindings.

# Apache Hadoop 2.4.X with Hive 13 support
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver 
-DskipTests clean package
Building for Scala 2.11
To produce a Spark package compiled with Scala 2.11, use the -Dscala-2.11 
property:

./dev/change-scala-version.sh 2.11
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
Spark does not yet support its JDBC component for Scala 2.11.
”
Source : http://spark.apache.org/docs/latest/building-spark.html

When I try to start the thrift server I get the following error:
“
16/04/05 16:09:11 INFO BlockManagerMaster: Registered BlockManager
16/04/05 16:09:12 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.UnknownHostException: namenode
at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at 
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at 
org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1667)
at

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-04-05 Thread Sean Owen

Following https://github.com/apache/spark/pull/12165#issuecomment-205791222
I'd like to make a point about process and then answer points below.

We have this funny system where anyone can propose a change, and any
of a few people can veto a change unilaterally. The latter rarely
comes up. 9 changes out of 10 nobody disagrees on; sometimes a
committer will say 'no' to a change and nobody else with that bit
disagrees.

Sometimes it matters and here I see, what, 4 out of 5 people including
committers supporting a particular change. A veto to oppose that is
pretty drastic. It's not something to use because you or customers
prefer a certain outcome. This reads like you're informing people
you've changed your mind and that's the decision, when it can't work
that way. I saw this happen to a lesser extent in the thread about
Scala 2.10.

It doesn't mean majority rules here either, but can I suggest you
instead counter-propose an outcome that the people here voting in
favor of what you're vetoing would probably also buy into? I bet
everyone's willing to give wide accommodation to your concerns. It's
probably not hard, like: let's plan to not support Java 7 in Spark
2.1.0. (Then we can debate the logic of that.)

On Mon, Apr 4, 2016 at 6:28 AM, Reynold Xin  wrote:
> some technology companies, are still using Java 7. One thing is that up
> until this date, users still can't install openjdk 8 on Ubuntu by default. I
> see that as an indication that it is too early to drop Java 7.

I have Java 8 on my Ubuntu instance, and installed it directly via apt-get.
http://openjdk.java.net/install/

> Looking at the timeline, JDK release a major new version roughly every 3
> years. We dropped Java 6 support one year ago, so from a timeline point of
> view we would be very aggressive here if we were to drop Java 7 support in
> Spark 2.0.

The metric is really (IMHO) when the JDK goes EOL. Java 6 was EOL in
Feb 2013, so supporting it into Spark 1.x was probably too long. Java
7 was EOL in April 2015. It's not really somehow every ~3 years.

> Note that not dropping Java 7 support now doesn't mean we have to support
> Java 7 throughout Spark 2.x. We dropped Java 6 support in Spark 1.5, even
> though Spark 1.0 started with Java 6.

Whatever arguments one has about preventing people from updating to
the latest and greatest then apply to a *minor* release, which is
worse. Java 6 support was probably overdue for removal at 1.0;
better-late-than-never, not necessarily the right time to do it.

> In terms of testing, Josh has actually improved our test infra so now we
> would run the Java 8 tests: https://github.com/apache/spark/pull/12073

Excellent, but, orthogonal.

Even if I personally don't see the merit in these arguments compared
to the counter-arguments, retaining Java 7 support now wouldn't be a
terrible outcome. I'd like to see better process and a more reasonable
compromise result though.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

RE: Build with Thrift Server & Scala 2.11

2016-04-05 Thread Raymond Honderdors

I can see that the build is successful
(-Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver 
–Dscala-2.11 -DskipTests clean package)

the documents page it still says that
“
Building With Hive and JDBC Support
To enable Hive integration for Spark SQL along with its JDBC server and CLI, 
add the -Phive and Phive-thriftserver profiles to your existing build options. 
By default Spark will build with Hive 0.13.1 bindings.

# Apache Hadoop 2.4.X with Hive 13 support
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver 
-DskipTests clean package
Building for Scala 2.11
To produce a Spark package compiled with Scala 2.11, use the -Dscala-2.11 
property:

./dev/change-scala-version.sh 2.11
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package
Spark does not yet support its JDBC component for Scala 2.11.
”
Source : http://spark.apache.org/docs/latest/building-spark.html

When I try to start the thrift server I get the following error:
“
16/04/05 16:09:11 INFO BlockManagerMaster: Registered BlockManager
16/04/05 16:09:12 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.UnknownHostException: namenode
at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at 
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at 
org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1667)
at 
org.apache.spark.scheduler.EventLoggingListener.(EventLoggingListener.scala:67)
at org.apache.spark.SparkContext.(SparkContext.scala:517)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:57)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:77)
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:726)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at 
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.UnknownHostException: namenode
... 26 more
16/04/05 16:09:12 INFO SparkUI: Stopped Spark web UI at 
http://10.10.182.195:4040
16/04/05 16:09:12 INFO SparkDeploySchedulerBackend: Shutting down all executors
”



Raymond Honderdors
Team Lead Analytics BI
Business Intelligence Developer
raymond.honderd...@sizmek.com
T +972.7325.3569
Herzliya

From: Reynold Xin [mailto:r...@databricks.com]
Sent: Tuesday, April 05, 2016 3:57 PM
To: Raymond Honderdors 
Cc: dev@spark.apache.org
Subject: Re: Build with Thrift Server & Scala 2.11

What do you mean? The Jenkins build for Spark uses 2.11 and also builds the 
thrift server.

On Tuesday, April 5, 2016, Raymond Honderdors 
> wrote:
Is anyone looking into this one, Build with Thrift Server & Scala 2.11?
I9f so when can we expect it

Raymond Honderdors
Team Lead Analytics BI
Business Intelligence Developer
raymond.honderd...@sizmek.com
T +972.7325.3569
Herzliya


[Read More]

[http://www.sizmek.com/Sizmek.png]


spark-raymond.honderdors-org.apache.spark.deploy.master.Master-1-CENTOS-RND-DEV.eyeblaster.com.out
Description:

Re: Build with Thrift Server & Scala 2.11

2016-04-05 Thread Reynold Xin

What do you mean? The Jenkins build for Spark uses 2.11 and also builds the
thrift server.

On Tuesday, April 5, 2016, Raymond Honderdors 
wrote:

> Is anyone looking into this one, Build with Thrift Server & Scala 2.11?
>
> I9f so when can we expect it
>
>
>
> *Raymond Honderdors *
>
> *Team Lead Analytics BI*
>
> *Business Intelligence Developer *
>
> *raymond.honderd...@sizmek.com
>  *
>
> *T +972.7325.3569*
>
> *Herzliya*
>
>
>
> [image: Read More] 
>
> 
>

Build with Thrift Server & Scala 2.11

2016-04-05 Thread Raymond Honderdors

Is anyone looking into this one, Build with Thrift Server & Scala 2.11?
I9f so when can we expect it

Raymond Honderdors
Team Lead Analytics BI
Business Intelligence Developer
raymond.honderd...@sizmek.com
T +972.7325.3569
Herzliya


[Read More]

[http://www.sizmek.com/Sizmek.png]

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

Re: BROKEN BUILD? Is this only me or not?

Re: BROKEN BUILD? Is this only me or not?

Re: Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

Re: BROKEN BUILD? Is this only me or not?

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Re: Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

BROKEN BUILD? Is this only me or not?

Re: [STREAMING] DStreamClosureSuite.scala with { return; ssc.sparkContext.emptyRDD[Int] } Why?!

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

Re: Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

Re: [STREAMING] DStreamClosureSuite.scala with { return; ssc.sparkContext.emptyRDD[Int] } Why?!

[STREAMING] DStreamClosureSuite.scala with { return; ssc.sparkContext.emptyRDD[Int] } Why?!

Updating Spark PR builder and 2.x test jobs to use Java 8 JDK

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Spark Streaming UI reporting a different task duration

Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

Re: Build with Thrift Server & Scala 2.11

Re: Build with Thrift Server & Scala 2.11

Re: RDD Partitions not distributed evenly to executors

Re: What influences the space complexity of Spark operations?

Re: [discuss] ending support for Java 7 in Spark 2.0

RE: Build with Thrift Server & Scala 2.11

Re: [discuss] ending support for Java 7 in Spark 2.0

RE: Build with Thrift Server & Scala 2.11

Re: Build with Thrift Server & Scala 2.11

Build with Thrift Server & Scala 2.11

34 matches

Site Navigation

Mail list logo

Footer information