Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Dongjoon Hyun
Thank you, Sean!

On Mon, May 23, 2016 at 2:09 PM, Sean Owen  wrote:

> No, because then none of the Java 8 support can build. Marcelo has a JIRA
> for handling that the right way with bootstrap class path config.
>
> Ideally it can be rolled into Jenkins though there are possibly historical
> reasons it was not enabled before. Best to fix those if possible but if not
> I'd rather have some automated checking than none. Checking lint is
> reasonably important.
>
> On Mon, May 23, 2016, 15:58 Ted Yu  wrote:
>
>> For #1 below, currently Jenkins uses Java 8:
>>
>> JAVA_HOME=/usr/java/jdk1.8.0_60
>>
>>
>> How about switching to Java 7 ?
>>
>>
>> My two cents.
>>
>>
>>


Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Sean Owen
No, because then none of the Java 8 support can build. Marcelo has a JIRA
for handling that the right way with bootstrap class path config.

Ideally it can be rolled into Jenkins though there are possibly historical
reasons it was not enabled before. Best to fix those if possible but if not
I'd rather have some automated checking than none. Checking lint is
reasonably important.

On Mon, May 23, 2016, 15:58 Ted Yu  wrote:

> For #1 below, currently Jenkins uses Java 8:
>
> JAVA_HOME=/usr/java/jdk1.8.0_60
>
>
> How about switching to Java 7 ?
>
>
> My two cents.
>
>
>


Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Ted Yu
For #1 below, currently Jenkins uses Java 8:

JAVA_HOME=/usr/java/jdk1.8.0_60


How about switching to Java 7 ?


My two cents.


On Mon, May 23, 2016 at 1:24 PM, Dongjoon Hyun  wrote:

> Thank you for your opinion!
>
> Sure. I know that history and totally agree with all your concerns.
> I indeed has hesitated about sending this kind of suggestion for a while.
>
> If Travis CI cannot handle those simple jobs at this time again,
> we must turn off from Spark PR queue.
> We can see the result quickly in one or two days.
> To turn on/off, Spark have nothing to do. INFRA team will do that.
>
> In fact, the goal is not about using another CI (like Travis), it is about
> preventing the followings.
>
> 1. JDK7 compilation errors. (Recently, 2 days ago and 5 days ago)
> 2. Java static errors. (Not critical but more frequently.)
> 3. Maven installation errors. (A month ago, it's reported in this mailing
> list.)
>
> Scala 2.10 compilation errors are fixed nearly instantly. But, 1~3 were
> not.
> If SparkPullRequestBuilder can do the above 1~3, that's the best for us.
> Do you think it is possible in some ways?
>
> By the way, as of today, Spark has 724 Java files and 96762 lines (without
> comment/blank).
> It's about 1/3 of Scala code. It's not small.
> --
> Language  files  blankcomment code
> --
> Scala  2368  63578 124904   322518
> Java724  18569  23445
> 96762
>
> Dongjoon.
>
>
>
> On Mon, May 23, 2016 at 12:20 PM, Michael Armbrust  > wrote:
>
>> We did turn on travis a few years ago, but ended up turning it off
>> because it was failing (I believe because of insufficient resources) which
>> was confusing for developers.  I wouldn't be opposed to turning it on if it
>> provides more/faster signal, but its not obvious to me that it would.  In
>> particular, do we know that given the rate PRs are created if we will hit
>> rate limits?
>>
>> Really my main feedback is, if the java linter is important we should
>> probably have it as part of the canonical build process.  I worry about
>> having more than one set of CI infrastructure to maintain.
>>
>> On Mon, May 23, 2016 at 9:43 AM, Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, Steve and Hyukjin.
>>>
>>> And, don't worry, Ted.
>>>
>>> Travis launches new VMs for every PR.
>>>
>>> Apache Spark repository uses the following setting.
>>>
>>> VM: Google Compute Engine
>>> OS: Ubuntu 14.04.3 LTS Server Edition 64bit
>>> CPU: ~2 CORE
>>> RAM: 7.5GB
>>>
>>> FYI, you can find more information about this here.
>>>
>>>
>>> https://docs.travis-ci.com/user/ci-environment/#Virtualization-environments
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>> On Mon, May 23, 2016 at 6:32 AM, Ted Yu  wrote:
>>>
 Do you know if more than one PR would be verified on the same machine ?

 I wonder whether the 'mvn install' from two simultaneous PR builds may
 have conflict.

 On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun 
 wrote:

> Thank you for feedback. Sure, correctly, that's the reason why the
> current SparkPullRequestBuilder do not run `lint-java`. :-)
>
> In addition, that's the same reason why contributors are reluctant to
> run `lint-java` and causes breaking on JDK7 builds.
>
> Such a tedious and time-consuming job should be done by CI without
> human interventions.
>
> By the way, why do you think we need to wait for that? We should not
> wait for any CIs, we should continue our own work.
>
> My proposal isn't for making you wait to watch the result. There are
> two use cases I want for us to focus here.
>
> Case 1: When you make a PR to Spark PR queue.
>
> Travis CI will finish before SparkPullRequestBuilder.
> We will run the followings in parallel mode.
>  1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no
> Java Linter)
>  2. Travis: JDK7 + mvn build + Java Linter
>  3. Travis: JDK8 + mvn build + Java Linter
>  As we know, 1 is the longest time-consuming one which have lots
> of works (except maven building or lint-  java). You don't need to wait
> more in many cases. Yes, in many cases, not all the cases.
>
>
> Case 2: When you prepare a PR on your branch.
>
> If you are at the final commit (maybe already-squashed), just go
> to case 1.
>
> However, usually, we makes lots of commits locally while making
> preparing our PR.
> And, finally we squashed them into one and send a PR to Spark.
> I mean you can use Travis CI during preparing your PRs.
> Again, don't wait for Travis CI. 

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Dongjoon Hyun
I want to clarify something here.

For Travis CI, it's free for open source projects and there is only one
management point, `.travis.xml`, for Spark community.

It's not some like physical Jenkins cluster farm. It's just a cloud service
like Github.

PS.
I'm also not an employee of Travis(or Github). :-)
If Spark uses Travis CI freely, they might dislike me for the heavy traffic.


On Mon, May 23, 2016 at 1:26 PM, Dongjoon Hyun  wrote:

> Thank you, Shane!
>
> I really hope that SparkPullRequestBuilder handle them if possible.
>
> Dongjoon.
>
> On Mon, May 23, 2016 at 1:24 PM, Dongjoon Hyun 
> wrote:
>
>> Thank you for your opinion!
>>
>> Sure. I know that history and totally agree with all your concerns.
>> I indeed has hesitated about sending this kind of suggestion for a while.
>>
>> If Travis CI cannot handle those simple jobs at this time again,
>> we must turn off from Spark PR queue.
>> We can see the result quickly in one or two days.
>> To turn on/off, Spark have nothing to do. INFRA team will do that.
>>
>> In fact, the goal is not about using another CI (like Travis), it is
>> about preventing the followings.
>>
>> 1. JDK7 compilation errors. (Recently, 2 days ago and 5 days ago)
>> 2. Java static errors. (Not critical but more frequently.)
>> 3. Maven installation errors. (A month ago, it's reported in this mailing
>> list.)
>>
>> Scala 2.10 compilation errors are fixed nearly instantly. But, 1~3 were
>> not.
>> If SparkPullRequestBuilder can do the above 1~3, that's the best for us.
>> Do you think it is possible in some ways?
>>
>> By the way, as of today, Spark has 724 Java files and 96762 lines
>> (without comment/blank).
>> It's about 1/3 of Scala code. It's not small.
>> --
>> Language  files  blankcomment code
>> --
>> Scala  2368  63578 124904   322518
>> Java724  18569  23445
>> 96762
>>
>> Dongjoon.
>>
>>
>>
>> On Mon, May 23, 2016 at 12:20 PM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> We did turn on travis a few years ago, but ended up turning it off
>>> because it was failing (I believe because of insufficient resources) which
>>> was confusing for developers.  I wouldn't be opposed to turning it on if it
>>> provides more/faster signal, but its not obvious to me that it would.  In
>>> particular, do we know that given the rate PRs are created if we will hit
>>> rate limits?
>>>
>>> Really my main feedback is, if the java linter is important we should
>>> probably have it as part of the canonical build process.  I worry about
>>> having more than one set of CI infrastructure to maintain.
>>>
>>> On Mon, May 23, 2016 at 9:43 AM, Dongjoon Hyun 
>>> wrote:
>>>
 Thank you, Steve and Hyukjin.

 And, don't worry, Ted.

 Travis launches new VMs for every PR.

 Apache Spark repository uses the following setting.

 VM: Google Compute Engine
 OS: Ubuntu 14.04.3 LTS Server Edition 64bit
 CPU: ~2 CORE
 RAM: 7.5GB

 FYI, you can find more information about this here.


 https://docs.travis-ci.com/user/ci-environment/#Virtualization-environments

 Dongjoon.



 On Mon, May 23, 2016 at 6:32 AM, Ted Yu  wrote:

> Do you know if more than one PR would be verified on the same machine ?
>
> I wonder whether the 'mvn install' from two simultaneous PR builds may
> have conflict.
>
> On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun 
> wrote:
>
>> Thank you for feedback. Sure, correctly, that's the reason why the
>> current SparkPullRequestBuilder do not run `lint-java`. :-)
>>
>> In addition, that's the same reason why contributors are reluctant to
>> run `lint-java` and causes breaking on JDK7 builds.
>>
>> Such a tedious and time-consuming job should be done by CI without
>> human interventions.
>>
>> By the way, why do you think we need to wait for that? We should not
>> wait for any CIs, we should continue our own work.
>>
>> My proposal isn't for making you wait to watch the result. There are
>> two use cases I want for us to focus here.
>>
>> Case 1: When you make a PR to Spark PR queue.
>>
>> Travis CI will finish before SparkPullRequestBuilder.
>> We will run the followings in parallel mode.
>>  1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no
>> Java Linter)
>>  2. Travis: JDK7 + mvn build + Java Linter
>>  3. Travis: JDK8 + mvn build + Java Linter
>>  As we know, 1 is the longest time-consuming one which have lots
>> of works (except maven 

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Dongjoon Hyun
Thank you, Shane!

I really hope that SparkPullRequestBuilder handle them if possible.

Dongjoon.

On Mon, May 23, 2016 at 1:24 PM, Dongjoon Hyun  wrote:

> Thank you for your opinion!
>
> Sure. I know that history and totally agree with all your concerns.
> I indeed has hesitated about sending this kind of suggestion for a while.
>
> If Travis CI cannot handle those simple jobs at this time again,
> we must turn off from Spark PR queue.
> We can see the result quickly in one or two days.
> To turn on/off, Spark have nothing to do. INFRA team will do that.
>
> In fact, the goal is not about using another CI (like Travis), it is about
> preventing the followings.
>
> 1. JDK7 compilation errors. (Recently, 2 days ago and 5 days ago)
> 2. Java static errors. (Not critical but more frequently.)
> 3. Maven installation errors. (A month ago, it's reported in this mailing
> list.)
>
> Scala 2.10 compilation errors are fixed nearly instantly. But, 1~3 were
> not.
> If SparkPullRequestBuilder can do the above 1~3, that's the best for us.
> Do you think it is possible in some ways?
>
> By the way, as of today, Spark has 724 Java files and 96762 lines (without
> comment/blank).
> It's about 1/3 of Scala code. It's not small.
> --
> Language  files  blankcomment code
> --
> Scala  2368  63578 124904   322518
> Java724  18569  23445
> 96762
>
> Dongjoon.
>
>
>
> On Mon, May 23, 2016 at 12:20 PM, Michael Armbrust  > wrote:
>
>> We did turn on travis a few years ago, but ended up turning it off
>> because it was failing (I believe because of insufficient resources) which
>> was confusing for developers.  I wouldn't be opposed to turning it on if it
>> provides more/faster signal, but its not obvious to me that it would.  In
>> particular, do we know that given the rate PRs are created if we will hit
>> rate limits?
>>
>> Really my main feedback is, if the java linter is important we should
>> probably have it as part of the canonical build process.  I worry about
>> having more than one set of CI infrastructure to maintain.
>>
>> On Mon, May 23, 2016 at 9:43 AM, Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, Steve and Hyukjin.
>>>
>>> And, don't worry, Ted.
>>>
>>> Travis launches new VMs for every PR.
>>>
>>> Apache Spark repository uses the following setting.
>>>
>>> VM: Google Compute Engine
>>> OS: Ubuntu 14.04.3 LTS Server Edition 64bit
>>> CPU: ~2 CORE
>>> RAM: 7.5GB
>>>
>>> FYI, you can find more information about this here.
>>>
>>>
>>> https://docs.travis-ci.com/user/ci-environment/#Virtualization-environments
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>> On Mon, May 23, 2016 at 6:32 AM, Ted Yu  wrote:
>>>
 Do you know if more than one PR would be verified on the same machine ?

 I wonder whether the 'mvn install' from two simultaneous PR builds may
 have conflict.

 On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun 
 wrote:

> Thank you for feedback. Sure, correctly, that's the reason why the
> current SparkPullRequestBuilder do not run `lint-java`. :-)
>
> In addition, that's the same reason why contributors are reluctant to
> run `lint-java` and causes breaking on JDK7 builds.
>
> Such a tedious and time-consuming job should be done by CI without
> human interventions.
>
> By the way, why do you think we need to wait for that? We should not
> wait for any CIs, we should continue our own work.
>
> My proposal isn't for making you wait to watch the result. There are
> two use cases I want for us to focus here.
>
> Case 1: When you make a PR to Spark PR queue.
>
> Travis CI will finish before SparkPullRequestBuilder.
> We will run the followings in parallel mode.
>  1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no
> Java Linter)
>  2. Travis: JDK7 + mvn build + Java Linter
>  3. Travis: JDK8 + mvn build + Java Linter
>  As we know, 1 is the longest time-consuming one which have lots
> of works (except maven building or lint-  java). You don't need to wait
> more in many cases. Yes, in many cases, not all the cases.
>
>
> Case 2: When you prepare a PR on your branch.
>
> If you are at the final commit (maybe already-squashed), just go
> to case 1.
>
> However, usually, we makes lots of commits locally while making
> preparing our PR.
> And, finally we squashed them into one and send a PR to Spark.
> I mean you can use Travis CI during preparing your PRs.
> Again, don't wait for Travis CI. Just push it sometime or at 

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Dongjoon Hyun
 Thank you for your opinion!

Sure. I know that history and totally agree with all your concerns.
I indeed has hesitated about sending this kind of suggestion for a while.

If Travis CI cannot handle those simple jobs at this time again,
we must turn off from Spark PR queue.
We can see the result quickly in one or two days.
To turn on/off, Spark have nothing to do. INFRA team will do that.

In fact, the goal is not about using another CI (like Travis), it is about
preventing the followings.

1. JDK7 compilation errors. (Recently, 2 days ago and 5 days ago)
2. Java static errors. (Not critical but more frequently.)
3. Maven installation errors. (A month ago, it's reported in this mailing
list.)

Scala 2.10 compilation errors are fixed nearly instantly. But, 1~3 were not.
If SparkPullRequestBuilder can do the above 1~3, that's the best for us.
Do you think it is possible in some ways?

By the way, as of today, Spark has 724 Java files and 96762 lines (without
comment/blank).
It's about 1/3 of Scala code. It's not small.
--
Language  files  blankcomment code
--
Scala  2368  63578 124904   322518
Java724  18569  23445
96762

Dongjoon.



On Mon, May 23, 2016 at 12:20 PM, Michael Armbrust 
wrote:

> We did turn on travis a few years ago, but ended up turning it off because
> it was failing (I believe because of insufficient resources) which was
> confusing for developers.  I wouldn't be opposed to turning it on if it
> provides more/faster signal, but its not obvious to me that it would.  In
> particular, do we know that given the rate PRs are created if we will hit
> rate limits?
>
> Really my main feedback is, if the java linter is important we should
> probably have it as part of the canonical build process.  I worry about
> having more than one set of CI infrastructure to maintain.
>
> On Mon, May 23, 2016 at 9:43 AM, Dongjoon Hyun 
> wrote:
>
>> Thank you, Steve and Hyukjin.
>>
>> And, don't worry, Ted.
>>
>> Travis launches new VMs for every PR.
>>
>> Apache Spark repository uses the following setting.
>>
>> VM: Google Compute Engine
>> OS: Ubuntu 14.04.3 LTS Server Edition 64bit
>> CPU: ~2 CORE
>> RAM: 7.5GB
>>
>> FYI, you can find more information about this here.
>>
>>
>> https://docs.travis-ci.com/user/ci-environment/#Virtualization-environments
>>
>> Dongjoon.
>>
>>
>>
>> On Mon, May 23, 2016 at 6:32 AM, Ted Yu  wrote:
>>
>>> Do you know if more than one PR would be verified on the same machine ?
>>>
>>> I wonder whether the 'mvn install' from two simultaneous PR builds may
>>> have conflict.
>>>
>>> On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun 
>>> wrote:
>>>
 Thank you for feedback. Sure, correctly, that's the reason why the
 current SparkPullRequestBuilder do not run `lint-java`. :-)

 In addition, that's the same reason why contributors are reluctant to
 run `lint-java` and causes breaking on JDK7 builds.

 Such a tedious and time-consuming job should be done by CI without
 human interventions.

 By the way, why do you think we need to wait for that? We should not
 wait for any CIs, we should continue our own work.

 My proposal isn't for making you wait to watch the result. There are
 two use cases I want for us to focus here.

 Case 1: When you make a PR to Spark PR queue.

 Travis CI will finish before SparkPullRequestBuilder.
 We will run the followings in parallel mode.
  1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no
 Java Linter)
  2. Travis: JDK7 + mvn build + Java Linter
  3. Travis: JDK8 + mvn build + Java Linter
  As we know, 1 is the longest time-consuming one which have lots of
 works (except maven building or lint-  java). You don't need to wait more
 in many cases. Yes, in many cases, not all the cases.


 Case 2: When you prepare a PR on your branch.

 If you are at the final commit (maybe already-squashed), just go to
 case 1.

 However, usually, we makes lots of commits locally while making
 preparing our PR.
 And, finally we squashed them into one and send a PR to Spark.
 I mean you can use Travis CI during preparing your PRs.
 Again, don't wait for Travis CI. Just push it sometime or at every
 commit, and continue your work.

 At the final stage when you finish your coding, squash your commits
 into one,
 and amend your commit title or messages, see the Travis CI.
 Or, you can monitor Travis CI result on status menu bar.
 If it shows green icon, you have nothing to do.

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread shane knapp
chiming in, as i'm the one who currently maintains the CI infrastructure...  :)

+1 on not having more than one CI system...  there's no way i can
commit to keeping an eye on anything else other than jenkins.

and i agree wholeheartedly w/michael:  if it's this important, let's
add it to the jenkins builds.

On Mon, May 23, 2016 at 12:20 PM, Michael Armbrust
 wrote:
> We did turn on travis a few years ago, but ended up turning it off because
> it was failing (I believe because of insufficient resources) which was
> confusing for developers.  I wouldn't be opposed to turning it on if it
> provides more/faster signal, but its not obvious to me that it would.  In
> particular, do we know that given the rate PRs are created if we will hit
> rate limits?
>
> Really my main feedback is, if the java linter is important we should
> probably have it as part of the canonical build process.  I worry about
> having more than one set of CI infrastructure to maintain.
>
> On Mon, May 23, 2016 at 9:43 AM, Dongjoon Hyun  wrote:
>>
>> Thank you, Steve and Hyukjin.
>>
>> And, don't worry, Ted.
>>
>> Travis launches new VMs for every PR.
>>
>> Apache Spark repository uses the following setting.
>>
>> VM: Google Compute Engine
>> OS: Ubuntu 14.04.3 LTS Server Edition 64bit
>> CPU: ~2 CORE
>> RAM: 7.5GB
>>
>> FYI, you can find more information about this here.
>>
>>
>> https://docs.travis-ci.com/user/ci-environment/#Virtualization-environments
>>
>> Dongjoon.
>>
>>
>>
>> On Mon, May 23, 2016 at 6:32 AM, Ted Yu  wrote:
>>>
>>> Do you know if more than one PR would be verified on the same machine ?
>>>
>>> I wonder whether the 'mvn install' from two simultaneous PR builds may
>>> have conflict.
>>>
>>> On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun 
>>> wrote:

 Thank you for feedback. Sure, correctly, that's the reason why the
 current SparkPullRequestBuilder do not run `lint-java`. :-)

 In addition, that's the same reason why contributors are reluctant to
 run `lint-java` and causes breaking on JDK7 builds.

 Such a tedious and time-consuming job should be done by CI without human
 interventions.

 By the way, why do you think we need to wait for that? We should not
 wait for any CIs, we should continue our own work.

 My proposal isn't for making you wait to watch the result. There are two
 use cases I want for us to focus here.

 Case 1: When you make a PR to Spark PR queue.

 Travis CI will finish before SparkPullRequestBuilder.
 We will run the followings in parallel mode.
  1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no Java
 Linter)
  2. Travis: JDK7 + mvn build + Java Linter
  3. Travis: JDK8 + mvn build + Java Linter
  As we know, 1 is the longest time-consuming one which have lots of
 works (except maven building or lint-  java). You don't need to wait more 
 in
 many cases. Yes, in many cases, not all the cases.


 Case 2: When you prepare a PR on your branch.

 If you are at the final commit (maybe already-squashed), just go to
 case 1.

 However, usually, we makes lots of commits locally while making
 preparing our PR.
 And, finally we squashed them into one and send a PR to Spark.
 I mean you can use Travis CI during preparing your PRs.
 Again, don't wait for Travis CI. Just push it sometime or at every
 commit, and continue your work.

 At the final stage when you finish your coding, squash your commits
 into one,
 and amend your commit title or messages, see the Travis CI.
 Or, you can monitor Travis CI result on status menu bar.
 If it shows green icon, you have nothing to do.

https://docs.travis-ci.com/user/apps/

 To sum up, I think we don't need to wait for any CIs. It's like an
 email. `Send and back to work.`

 Dongjoon.


 On Sun, May 22, 2016 at 8:32 PM, Ted Yu  wrote:
>
> Without Zinc, 'mvn -DskipTests clean install' takes ~30 minutes.
>
> Maybe not everyone is willing to wait that long.
>
> On Sun, May 22, 2016 at 1:30 PM, Dongjoon Hyun 
> wrote:
>>
>> Oh, Sure. My bad!
>>
>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>> - For Oracle JDK8, mvn -DskipTests install and run `dev/lint-java`.
>>
>> Thank you, Ted.
>>
>> Dongjoon.
>>
>> On Sun, May 22, 2016 at 1:29 PM, Ted Yu  wrote:
>>>
>>> The following line was repeated twice:
>>>
>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>>
>>> Did you intend to cover JDK 8 ?
>>>
>>> Cheers
>>>
>>> On Sun, May 22, 2016 at 1:25 PM, Dongjoon 

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Michael Armbrust
We did turn on travis a few years ago, but ended up turning it off because
it was failing (I believe because of insufficient resources) which was
confusing for developers.  I wouldn't be opposed to turning it on if it
provides more/faster signal, but its not obvious to me that it would.  In
particular, do we know that given the rate PRs are created if we will hit
rate limits?

Really my main feedback is, if the java linter is important we should
probably have it as part of the canonical build process.  I worry about
having more than one set of CI infrastructure to maintain.

On Mon, May 23, 2016 at 9:43 AM, Dongjoon Hyun  wrote:

> Thank you, Steve and Hyukjin.
>
> And, don't worry, Ted.
>
> Travis launches new VMs for every PR.
>
> Apache Spark repository uses the following setting.
>
> VM: Google Compute Engine
> OS: Ubuntu 14.04.3 LTS Server Edition 64bit
> CPU: ~2 CORE
> RAM: 7.5GB
>
> FYI, you can find more information about this here.
>
> https://docs.travis-ci.com/user/ci-environment/#Virtualization-environments
>
> Dongjoon.
>
>
>
> On Mon, May 23, 2016 at 6:32 AM, Ted Yu  wrote:
>
>> Do you know if more than one PR would be verified on the same machine ?
>>
>> I wonder whether the 'mvn install' from two simultaneous PR builds may
>> have conflict.
>>
>> On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for feedback. Sure, correctly, that's the reason why the
>>> current SparkPullRequestBuilder do not run `lint-java`. :-)
>>>
>>> In addition, that's the same reason why contributors are reluctant to
>>> run `lint-java` and causes breaking on JDK7 builds.
>>>
>>> Such a tedious and time-consuming job should be done by CI without human
>>> interventions.
>>>
>>> By the way, why do you think we need to wait for that? We should not
>>> wait for any CIs, we should continue our own work.
>>>
>>> My proposal isn't for making you wait to watch the result. There are two
>>> use cases I want for us to focus here.
>>>
>>> Case 1: When you make a PR to Spark PR queue.
>>>
>>> Travis CI will finish before SparkPullRequestBuilder.
>>> We will run the followings in parallel mode.
>>>  1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no
>>> Java Linter)
>>>  2. Travis: JDK7 + mvn build + Java Linter
>>>  3. Travis: JDK8 + mvn build + Java Linter
>>>  As we know, 1 is the longest time-consuming one which have lots of
>>> works (except maven building or lint-  java). You don't need to wait more
>>> in many cases. Yes, in many cases, not all the cases.
>>>
>>>
>>> Case 2: When you prepare a PR on your branch.
>>>
>>> If you are at the final commit (maybe already-squashed), just go to
>>> case 1.
>>>
>>> However, usually, we makes lots of commits locally while making
>>> preparing our PR.
>>> And, finally we squashed them into one and send a PR to Spark.
>>> I mean you can use Travis CI during preparing your PRs.
>>> Again, don't wait for Travis CI. Just push it sometime or at every
>>> commit, and continue your work.
>>>
>>> At the final stage when you finish your coding, squash your commits
>>> into one,
>>> and amend your commit title or messages, see the Travis CI.
>>> Or, you can monitor Travis CI result on status menu bar.
>>> If it shows green icon, you have nothing to do.
>>>
>>>https://docs.travis-ci.com/user/apps/
>>>
>>> To sum up, I think we don't need to wait for any CIs. It's like an
>>> email. `Send and back to work.`
>>>
>>> Dongjoon.
>>>
>>>
>>> On Sun, May 22, 2016 at 8:32 PM, Ted Yu  wrote:
>>>
 Without Zinc, 'mvn -DskipTests clean install' takes ~30 minutes.

 Maybe not everyone is willing to wait that long.

 On Sun, May 22, 2016 at 1:30 PM, Dongjoon Hyun 
 wrote:

> Oh, Sure. My bad!
>
> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
> - For Oracle JDK8, mvn -DskipTests install and run `dev/lint-java`.
>
> Thank you, Ted.
>
> Dongjoon.
>
> On Sun, May 22, 2016 at 1:29 PM, Ted Yu  wrote:
>
>> The following line was repeated twice:
>>
>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>
>> Did you intend to cover JDK 8 ?
>>
>> Cheers
>>
>> On Sun, May 22, 2016 at 1:25 PM, Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> I want to propose the followings.
>>>
>>> - Turn on Travis CI for Apache Spark PR queue.
>>> - Recommend this for contributors, too
>>>
>>> Currently, Spark provides Travis CI configuration file to help
>>> contributors check Scala/Java style conformance and JDK7/8 compilation
>>> easily during their preparing pull requests. Please note that it's only
>>> about static analysis.
>>>
>>> - For Oracle JDK7, mvn -DskipTests 

Re: spark on kubernetes

2016-05-23 Thread Gurvinder Singh
OK created this issue https://issues.apache.org/jira/browse/SPARK-15487
please comment on this and also let me know if anyone want to
collaborate on implementing it. Its my first contribution to Spark so
will be exciting.

- Gurvinder
On 05/23/2016 07:55 PM, Gurvinder Singh wrote:
> On 05/23/2016 07:18 PM, Radoslaw Gruchalski wrote:
>> Sounds surprisingly close to this:
>> https://github.com/apache/spark/pull/9608
>>
> I might have overlooked it but bridge mode work appears to make Spark
> work with docker containers and able to communicate with them when
> running on more than one machines.
> 
> Here I am trying to enable getting information from Spark UI
> irrespective of Spark running in containers or not. Spark UI's link to
> workers and application drivers are pointing to internal/protected
> network. So to get this information from user's machine, he/she has to
> connect to VPN. Therefore the proposal is to make Spark master UI
> reverse proxy this information back to user. So only Spark master UI
> needs to be opened up to internet and there is no need to change
> anything else how Spark runs internally either in Standalone mode, Mesos
> or in containers on kubernetes.
> 
> - Gurvinder
>> I can ressurect the work on the bridge mode for Spark 2. The reason why
>> the work on the old one was suspended was because Spark was going
>> through so many changes at that time that a lot of work done, was wiped
>> out by the changes towards 2.0.
>>
>> I know that Lightbend was also interested in having bridge mode.
>>
>> –
>> Best regards,

>> Radek Gruchalski
>> 
ra...@gruchalski.com 
>> de.linkedin.com/in/radgruchalski
>>
>> *Confidentiality:
>> *This communication is intended for the above-named person and may be
>> confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>>
>> On May 23, 2016 at 7:14:51 PM, Timothy Chen (tnac...@gmail.com
>> ) wrote:
>>
>>> This will also simplify Mesos users as well, DCOS has to work around
>>> this with our own proxying.
>>>
>>> Tim
>>>
>>> On Sun, May 22, 2016 at 11:53 PM, Gurvinder Singh
>>>  wrote:
 Hi Reynold,

 So if that's OK with you, can I go ahead and create JIRA for this. As it
 seems this feature is missing currently and can benefit not just for
 kubernetes users but in general Spark standalone mode users too.

 - Gurvinder
 On 05/22/2016 12:49 PM, Gurvinder Singh wrote:
> On 05/22/2016 10:23 AM, Sun Rui wrote:
>> If it is possible to rewrite URL in outbound responses in Knox or other 
>> reverse proxy, would that solve your issue?
> Any process which can keep track of workers and application drivers IP
> addresses and route traffic to those will work. Considering Spark Master
> does exactly this due to all workers and application has to register to
> the master, therefore I propose master to be the place to add such a
> functionality.
>
> I am not aware with Knox capabilities but Nginx or any other normal
> reverse proxy will not be able to this on its own due to dynamic nature
> of application drivers and to some extent workers too.
>
> - Gurvinder
>>> On May 22, 2016, at 14:55, Gurvinder Singh  
>>> wrote:
>>>
>>> On 05/22/2016 08:32 AM, Reynold Xin wrote:
 Kubernetes itself already has facilities for http proxy, doesn't it?

>>> Yeah kubernetes has ingress controller which can act the L7 load
>>> balancer and router traffic to Spark UI in this case. But I am referring
>>> to link present in UI to worker and application UI. Replied in the
>>> detail to Sun Rui's mail where I gave example of possible scenario.
>>>
>>> - Gurvinder

 On Sat, May 21, 2016 at 9:30 AM, Gurvinder Singh
 > wrote:

Hi,

I am currently working on deploying Spark on kuberentes (K8s) and 
 it is
working fine. I am running Spark with standalone mode and 
 checkpointing
the state to shared system. So if master fails K8s starts it and 
 from
checkpoint it recover the earlier state and things just works fine. 
 I
have an issue with the Spark master Web UI to access the worker and
application UI links. In brief, kubernetes service model allows me 
 to
expose the master service to internet, but accessing the
application/workers UI is not possible as then I have to expose 
 them too
individually and given I can have multiple application it becomes 
 hard
to manage.

One solution 

Re: [VOTE] Removing module maintainer process

2016-05-23 Thread Holden Karau
+1 non-binding (as a contributor anything which speed things up is worth a
try, and git blame is a good enough substitute for the list when figuring
out who to ping on a PR).

On Monday, May 23, 2016, Imran Rashid  wrote:

> +1 (binding)
>
> On Mon, May 23, 2016 at 8:13 AM, Tom Graves  > wrote:
>
>> +1 (binding)
>>
>> Tom
>>
>>
>> On Sunday, May 22, 2016 7:34 PM, Matei Zaharia > > wrote:
>>
>>
>> It looks like the discussion thread on this has only had positive
>> replies, so I'm going to call a VOTE. The proposal is to remove the
>> maintainer process in 
>> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers
>> <
>> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers>
>> given that it doesn't seem to have had a huge impact on the project, and it
>> can unnecessarily create friction in contributing. We already have +1s from
>> Mridul, Tom, Andrew Or and Imran on that thread.
>>
>> I'll leave the VOTE open for 48 hours, until 9 PM EST on May 24, 2016.
>>
>> Matei
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> 
>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>>
>>
>>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: spark on kubernetes

2016-05-23 Thread Radoslaw Gruchalski
Sounds surprisingly close to this:
https://github.com/apache/spark/pull/9608

I can ressurect the work on the bridge mode for Spark 2. The reason why the 
work on the old one was suspended was because Spark was going through so many 
changes at that time that a lot of work done, was wiped out by the changes 
towards 2.0.

I know that Lightbend was also interested in having bridge mode.
–  
Best regards,

Radek Gruchalski

ra...@gruchalski.com
de.linkedin.com/in/radgruchalski

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.

On May 23, 2016 at 7:14:51 PM, Timothy Chen (tnac...@gmail.com) wrote:

This will also simplify Mesos users as well, DCOS has to work around  
this with our own proxying.  

Tim  

On Sun, May 22, 2016 at 11:53 PM, Gurvinder Singh  
 wrote:  
> Hi Reynold,  
>  
> So if that's OK with you, can I go ahead and create JIRA for this. As it  
> seems this feature is missing currently and can benefit not just for  
> kubernetes users but in general Spark standalone mode users too.  
>  
> - Gurvinder  
> On 05/22/2016 12:49 PM, Gurvinder Singh wrote:  
>> On 05/22/2016 10:23 AM, Sun Rui wrote:  
>>> If it is possible to rewrite URL in outbound responses in Knox or other 
>>> reverse proxy, would that solve your issue?  
>> Any process which can keep track of workers and application drivers IP  
>> addresses and route traffic to those will work. Considering Spark Master  
>> does exactly this due to all workers and application has to register to  
>> the master, therefore I propose master to be the place to add such a  
>> functionality.  
>>  
>> I am not aware with Knox capabilities but Nginx or any other normal  
>> reverse proxy will not be able to this on its own due to dynamic nature  
>> of application drivers and to some extent workers too.  
>>  
>> - Gurvinder  
 On May 22, 2016, at 14:55, Gurvinder Singh  
 wrote:  
  
 On 05/22/2016 08:32 AM, Reynold Xin wrote:  
> Kubernetes itself already has facilities for http proxy, doesn't it?  
>  
 Yeah kubernetes has ingress controller which can act the L7 load  
 balancer and router traffic to Spark UI in this case. But I am referring  
 to link present in UI to worker and application UI. Replied in the  
 detail to Sun Rui's mail where I gave example of possible scenario.  
  
 - Gurvinder  
>  
> On Sat, May 21, 2016 at 9:30 AM, Gurvinder Singh  
> > wrote:  
>  
> Hi,  
>  
> I am currently working on deploying Spark on kuberentes (K8s) and it is  
> working fine. I am running Spark with standalone mode and checkpointing  
> the state to shared system. So if master fails K8s starts it and from  
> checkpoint it recover the earlier state and things just works fine. I  
> have an issue with the Spark master Web UI to access the worker and  
> application UI links. In brief, kubernetes service model allows me to  
> expose the master service to internet, but accessing the  
> application/workers UI is not possible as then I have to expose them too  
> individually and given I can have multiple application it becomes hard  
> to manage.  
>  
> One solution can be that the master can act as reverse proxy to access  
> information/state/logs from application/workers. As it has the  
> information about their endpoint when application/worker register with  
> master, so when a user initiate a request to access the information,  
> master can proxy the request to corresponding endpoint.  
>  
> So I am wondering if someone has already done work in this direction  
> then it would be great to know. If not then would the community will be  
> interesting in such feature. If yes then how and where I should get  
> started as it would be helpful for me to have some guidance to start  
> working on this.  
>  
> Kind Regards,  
> Gurvinder  
>  
> -  
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org  
>   
> For additional commands, e-mail: dev-h...@spark.apache.org  
>   
>  
>  
  
  
 -  
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org  
 For additional commands, e-mail: dev-h...@spark.apache.org  
  
>>>  
>>>  
>>>  
>>> -  
>>> To unsubscribe, e-mail: 

Re: Running TPCDSQueryBenchmark results in java.lang.OutOfMemoryError

2016-05-23 Thread Ovidiu-Cristian MARCU
Yes,

git log
commit dafcb05c2ef8e09f45edfb7eabf58116c23975a0
Author: Sameer Agarwal 
Date:   Sun May 22 23:32:39 2016 -0700

for #2 see my comments in https://issues.apache.org/jira/browse/SPARK-15078 


> On 23 May 2016, at 18:16, Ted Yu  wrote:
> 
> Can you tell us the commit hash using which the test was run ?
> 
> For #2, if you can give full stack trace, that would be nice.
> 
> Thanks
> 
> On Mon, May 23, 2016 at 8:58 AM, Ovidiu-Cristian MARCU 
> > 
> wrote:
> Hi
> 
> 1) Using latest spark 2.0 I've managed to run TPCDSQueryBenchmark first 9 
> queries and then it ends in the OutOfMemoryError [1].
> 
> What was the configuration used for running this benchmark? Can you explain 
> the meaning of 4 shuffle partitions? Thanks!
> 
> On my local system I use:
> ./bin/spark-submit --class 
> org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --master 
> local[4] jars/spark-sql_2.11-2.0.0-SNAPSHOT-tests.jar
> configured with:
>   .set("spark.sql.parquet.compression.codec", "snappy")
>   .set("spark.sql.shuffle.partitions", "4")
>   .set("spark.driver.memory", "3g")
>   .set("spark.executor.memory", "3g")
>   .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 
> 1024).toString)
> 
> Scale factor of TPCDS is 5, data generated using notes from 
> https://github.com/databricks/spark-sql-perf 
> .
> 
> 2) Running spark-sql-perf with: val experiment = 
> tpcds.runExperiment(tpcds.runnable) on the same dataset reveals some 
> exceptions:
> 
> Running execution q9-v1.4 iteration: 1, StandardRun=true
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.ScalarSubquery.dataType(subquery.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.CaseWhenBase.dataType(conditionalExpressions.scala:103)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:165)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
> ...   at 
> org.apache.spark.sql.execution.ProjectExec.output(basicPhysicalOperators.scala:33)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.output(WholeStageCodegenExec.scala:289)
>   at 
> org.apache.spark.sql.execution.DeserializeToObject$$anonfun$2.apply(objects.scala:61)
>   at 
> org.apache.spark.sql.execution.DeserializeToObject$$anonfun$2.apply(objects.scala:60)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
> 
> or
> 
> Running execution q25-v1.4 iteration: 1, StandardRun=true
> java.lang.IllegalStateException: Task -1024 has already locked 
> broadcast_755_piece0 for writing
>   at 
> org.apache.spark.storage.BlockInfoManager.lockForWriting(BlockInfoManager.scala:232)
>   at 
> org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1296)
> 
> Best,
> Ovidiu
> 
> [1]
> Exception in thread "broadcast-exchange-164" java.lang.OutOfMemoryError: Java 
> heap space
>   at 
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:539)
>   at 
> org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:803)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:816)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:812)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:89)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:71)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:94)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:71)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:71)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 



Re: spark on kubernetes

2016-05-23 Thread Timothy Chen
This will also simplify Mesos users as well, DCOS has to work around
this with our own proxying.

Tim

On Sun, May 22, 2016 at 11:53 PM, Gurvinder Singh
 wrote:
> Hi Reynold,
>
> So if that's OK with you, can I go ahead and create JIRA for this. As it
> seems this feature is missing currently and can benefit not just for
> kubernetes users but in general Spark standalone mode users too.
>
> - Gurvinder
> On 05/22/2016 12:49 PM, Gurvinder Singh wrote:
>> On 05/22/2016 10:23 AM, Sun Rui wrote:
>>> If it is possible to rewrite URL in outbound responses in Knox or other 
>>> reverse proxy, would that solve your issue?
>> Any process which can keep track of workers and application drivers IP
>> addresses and route traffic to those will work. Considering Spark Master
>> does exactly this due to all workers and application has to register to
>> the master, therefore I propose master to be the place to add such a
>> functionality.
>>
>> I am not aware with Knox capabilities but Nginx or any other normal
>> reverse proxy will not be able to this on its own due to dynamic nature
>> of application drivers and to some extent workers too.
>>
>> - Gurvinder
 On May 22, 2016, at 14:55, Gurvinder Singh  
 wrote:

 On 05/22/2016 08:32 AM, Reynold Xin wrote:
> Kubernetes itself already has facilities for http proxy, doesn't it?
>
 Yeah kubernetes has ingress controller which can act the L7 load
 balancer and router traffic to Spark UI in this case. But I am referring
 to link present in UI to worker and application UI. Replied in the
 detail to Sun Rui's mail where I gave example of possible scenario.

 - Gurvinder
>
> On Sat, May 21, 2016 at 9:30 AM, Gurvinder Singh
> > wrote:
>
>Hi,
>
>I am currently working on deploying Spark on kuberentes (K8s) and it is
>working fine. I am running Spark with standalone mode and checkpointing
>the state to shared system. So if master fails K8s starts it and from
>checkpoint it recover the earlier state and things just works fine. I
>have an issue with the Spark master Web UI to access the worker and
>application UI links. In brief, kubernetes service model allows me to
>expose the master service to internet, but accessing the
>application/workers UI is not possible as then I have to expose them 
> too
>individually and given I can have multiple application it becomes hard
>to manage.
>
>One solution can be that the master can act as reverse proxy to access
>information/state/logs from application/workers. As it has the
>information about their endpoint when application/worker register with
>master, so when a user initiate a request to access the information,
>master can proxy the request to corresponding endpoint.
>
>So I am wondering if someone has already done work in this direction
>then it would be great to know. If not then would the community will be
>interesting in such feature. If yes then how and where I should get
>started as it would be helpful for me to have some guidance to start
>working on this.
>
>Kind Regards,
>Gurvinder
>
>-
>To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>
>For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



How to map values read from test file to 2 different RDDs

2016-05-23 Thread Deepak Sharma
Hi
I am reading a text file with 16 fields.
All the place holders for the values of this text file has been defined in
say 2 different case classes:
Case1 and Case2

How do i map values read from text file , so my function in scala should be
able to return 2 different RDDs , with each each RDD of these 2 different
cse class type?
E.g first 11 fields mapped to Case1 while rest 6 fields mapped to Case2
Any pointer here or code snippet would be really helpful.


-- 
Thanks
Deepak


Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Dongjoon Hyun
Thank you, Steve and Hyukjin.

And, don't worry, Ted.

Travis launches new VMs for every PR.

Apache Spark repository uses the following setting.

VM: Google Compute Engine
OS: Ubuntu 14.04.3 LTS Server Edition 64bit
CPU: ~2 CORE
RAM: 7.5GB

FYI, you can find more information about this here.

https://docs.travis-ci.com/user/ci-environment/#Virtualization-environments

Dongjoon.



On Mon, May 23, 2016 at 6:32 AM, Ted Yu  wrote:

> Do you know if more than one PR would be verified on the same machine ?
>
> I wonder whether the 'mvn install' from two simultaneous PR builds may
> have conflict.
>
> On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun 
> wrote:
>
>> Thank you for feedback. Sure, correctly, that's the reason why the
>> current SparkPullRequestBuilder do not run `lint-java`. :-)
>>
>> In addition, that's the same reason why contributors are reluctant to run
>> `lint-java` and causes breaking on JDK7 builds.
>>
>> Such a tedious and time-consuming job should be done by CI without human
>> interventions.
>>
>> By the way, why do you think we need to wait for that? We should not wait
>> for any CIs, we should continue our own work.
>>
>> My proposal isn't for making you wait to watch the result. There are two
>> use cases I want for us to focus here.
>>
>> Case 1: When you make a PR to Spark PR queue.
>>
>> Travis CI will finish before SparkPullRequestBuilder.
>> We will run the followings in parallel mode.
>>  1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no Java
>> Linter)
>>  2. Travis: JDK7 + mvn build + Java Linter
>>  3. Travis: JDK8 + mvn build + Java Linter
>>  As we know, 1 is the longest time-consuming one which have lots of
>> works (except maven building or lint-  java). You don't need to wait more
>> in many cases. Yes, in many cases, not all the cases.
>>
>>
>> Case 2: When you prepare a PR on your branch.
>>
>> If you are at the final commit (maybe already-squashed), just go to
>> case 1.
>>
>> However, usually, we makes lots of commits locally while making
>> preparing our PR.
>> And, finally we squashed them into one and send a PR to Spark.
>> I mean you can use Travis CI during preparing your PRs.
>> Again, don't wait for Travis CI. Just push it sometime or at every
>> commit, and continue your work.
>>
>> At the final stage when you finish your coding, squash your commits
>> into one,
>> and amend your commit title or messages, see the Travis CI.
>> Or, you can monitor Travis CI result on status menu bar.
>> If it shows green icon, you have nothing to do.
>>
>>https://docs.travis-ci.com/user/apps/
>>
>> To sum up, I think we don't need to wait for any CIs. It's like an email.
>> `Send and back to work.`
>>
>> Dongjoon.
>>
>>
>> On Sun, May 22, 2016 at 8:32 PM, Ted Yu  wrote:
>>
>>> Without Zinc, 'mvn -DskipTests clean install' takes ~30 minutes.
>>>
>>> Maybe not everyone is willing to wait that long.
>>>
>>> On Sun, May 22, 2016 at 1:30 PM, Dongjoon Hyun 
>>> wrote:
>>>
 Oh, Sure. My bad!

 - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
 - For Oracle JDK8, mvn -DskipTests install and run `dev/lint-java`.

 Thank you, Ted.

 Dongjoon.

 On Sun, May 22, 2016 at 1:29 PM, Ted Yu  wrote:

> The following line was repeated twice:
>
> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>
> Did you intend to cover JDK 8 ?
>
> Cheers
>
> On Sun, May 22, 2016 at 1:25 PM, Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> I want to propose the followings.
>>
>> - Turn on Travis CI for Apache Spark PR queue.
>> - Recommend this for contributors, too
>>
>> Currently, Spark provides Travis CI configuration file to help
>> contributors check Scala/Java style conformance and JDK7/8 compilation
>> easily during their preparing pull requests. Please note that it's only
>> about static analysis.
>>
>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>> Scalastyle is included in the step 'mvn install', too.
>>
>> Yep, if you turn on your Travis CI configuration, you can already see
>> the results on your branches before making PR. I wrote this email to
>> prevent more failures proactively and community-widely.
>>
>> For stability, I have been monitoring that for two weeks. It detects
>> the failures or recovery on JDK7 builds or Java linter on Spark master
>> branch correctly. The only exceptional case I observed rarely is 
>> `timeout`
>> failure due to hangs of maven. But, as we know, it's happen in our 
>> Jenkins
>> SparkPullRequestBuilder, too. I think we can ignore that.

Re: Running TPCDSQueryBenchmark results in java.lang.OutOfMemoryError

2016-05-23 Thread Ted Yu
Can you tell us the commit hash using which the test was run ?

For #2, if you can give full stack trace, that would be nice.

Thanks

On Mon, May 23, 2016 at 8:58 AM, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> Hi
>
> 1) Using latest spark 2.0 I've managed to run TPCDSQueryBenchmark first 9
> queries and then it ends in the OutOfMemoryError [1].
>
> *What was the configuration used for running this benchmark? Can you
> explain the meaning of 4 shuffle partitions? Thanks!*
>
> On my local system I use:
> ./bin/spark-submit --class
> org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --master
> local[4] jars/spark-sql_2.11-2.0.0-SNAPSHOT-tests.jar
> configured with:
>   .set("spark.sql.parquet.compression.codec", "snappy")
>   .set("spark.sql.shuffle.partitions", "4")
>   .set("spark.driver.memory", "3g")
>   .set("spark.executor.memory", "3g")
>   .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024
> ).toString)
>
> Scale factor of TPCDS is 5, data generated using notes from
> https://github.com/databricks/spark-sql-perf.
>
> 2) Running spark-sql-perf with: val experiment =
> tpcds.runExperiment(tpcds.runnable) on the same dataset reveals some
> exceptions:
>
> Running execution *q9-v1.4* iteration: 1, StandardRun=true
> java.lang.NullPointerException
> at
> org.apache.spark.sql.execution.ScalarSubquery.dataType(subquery.scala:45)
> at
> org.apache.spark.sql.catalyst.expressions.CaseWhenBase.dataType(conditionalExpressions.scala:103)
> at
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:165)
> at
> org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
> ... at
> org.apache.spark.sql.execution.ProjectExec.output(basicPhysicalOperators.scala:33)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec.output(WholeStageCodegenExec.scala:289)
> at
> org.apache.spark.sql.execution.DeserializeToObject$$anonfun$2.apply(objects.scala:61)
> at
> org.apache.spark.sql.execution.DeserializeToObject$$anonfun$2.apply(objects.scala:60)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
>
> or
>
> Running execution q25-v1.4 iteration: 1, StandardRun=true
> java.lang.IllegalStateException: Task -1024 has already locked
> broadcast_755_piece0 for writing
> at
> org.apache.spark.storage.BlockInfoManager.lockForWriting(BlockInfoManager.scala:232)
> at
> org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1296)
>
> Best,
> Ovidiu
>
> [1]
> Exception in thread "broadcast-exchange-164" java.lang.OutOfMemoryError:
> Java heap space
> at
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:539)
> at
> org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:803)
> at
> org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:105)
> at
> org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:816)
> at
> org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:812)
> at
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:89)
> at
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:71)
> at
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:94)
> at
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:71)
> at
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:71)
> at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> at
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>


Running TPCDSQueryBenchmark results in java.lang.OutOfMemoryError

2016-05-23 Thread Ovidiu-Cristian MARCU
Hi

1) Using latest spark 2.0 I've managed to run TPCDSQueryBenchmark first 9 
queries and then it ends in the OutOfMemoryError [1].

What was the configuration used for running this benchmark? Can you explain the 
meaning of 4 shuffle partitions? Thanks!

On my local system I use:
./bin/spark-submit --class 
org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --master local[4] 
jars/spark-sql_2.11-2.0.0-SNAPSHOT-tests.jar
configured with:
  .set("spark.sql.parquet.compression.codec", "snappy")
  .set("spark.sql.shuffle.partitions", "4")
  .set("spark.driver.memory", "3g")
  .set("spark.executor.memory", "3g")
  .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString)

Scale factor of TPCDS is 5, data generated using notes from 
https://github.com/databricks/spark-sql-perf 
.

2) Running spark-sql-perf with: val experiment = 
tpcds.runExperiment(tpcds.runnable) on the same dataset reveals some exceptions:

Running execution q9-v1.4 iteration: 1, StandardRun=true
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.ScalarSubquery.dataType(subquery.scala:45)
at 
org.apache.spark.sql.catalyst.expressions.CaseWhenBase.dataType(conditionalExpressions.scala:103)
at 
org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:165)
at 
org.apache.spark.sql.execution.ProjectExec$$anonfun$output$1.apply(basicPhysicalOperators.scala:33)
... at 
org.apache.spark.sql.execution.ProjectExec.output(basicPhysicalOperators.scala:33)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.output(WholeStageCodegenExec.scala:289)
at 
org.apache.spark.sql.execution.DeserializeToObject$$anonfun$2.apply(objects.scala:61)
at 
org.apache.spark.sql.execution.DeserializeToObject$$anonfun$2.apply(objects.scala:60)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)

or

Running execution q25-v1.4 iteration: 1, StandardRun=true
java.lang.IllegalStateException: Task -1024 has already locked 
broadcast_755_piece0 for writing
at 
org.apache.spark.storage.BlockInfoManager.lockForWriting(BlockInfoManager.scala:232)
at 
org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1296)

Best,
Ovidiu

[1]
Exception in thread "broadcast-exchange-164" java.lang.OutOfMemoryError: Java 
heap space
at 
org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:539)
at 
org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:803)
at 
org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:105)
at 
org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:816)
at 
org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:812)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:89)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:71)
at 
org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:94)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:71)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:71)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Re: [VOTE] Removing module maintainer process

2016-05-23 Thread Imran Rashid
+1 (binding)

On Mon, May 23, 2016 at 8:13 AM, Tom Graves 
wrote:

> +1 (binding)
>
> Tom
>
>
> On Sunday, May 22, 2016 7:34 PM, Matei Zaharia 
> wrote:
>
>
> It looks like the discussion thread on this has only had positive replies,
> so I'm going to call a VOTE. The proposal is to remove the maintainer
> process in 
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers
> <
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers>
> given that it doesn't seem to have had a huge impact on the project, and it
> can unnecessarily create friction in contributing. We already have +1s from
> Mridul, Tom, Andrew Or and Imran on that thread.
>
> I'll leave the VOTE open for 48 hours, until 9 PM EST on May 24, 2016.
>
> Matei
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>


Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Ted Yu
Do you know if more than one PR would be verified on the same machine ?

I wonder whether the 'mvn install' from two simultaneous PR builds may have
conflict.

On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun  wrote:

> Thank you for feedback. Sure, correctly, that's the reason why the current
> SparkPullRequestBuilder do not run `lint-java`. :-)
>
> In addition, that's the same reason why contributors are reluctant to run
> `lint-java` and causes breaking on JDK7 builds.
>
> Such a tedious and time-consuming job should be done by CI without human
> interventions.
>
> By the way, why do you think we need to wait for that? We should not wait
> for any CIs, we should continue our own work.
>
> My proposal isn't for making you wait to watch the result. There are two
> use cases I want for us to focus here.
>
> Case 1: When you make a PR to Spark PR queue.
>
> Travis CI will finish before SparkPullRequestBuilder.
> We will run the followings in parallel mode.
>  1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no Java
> Linter)
>  2. Travis: JDK7 + mvn build + Java Linter
>  3. Travis: JDK8 + mvn build + Java Linter
>  As we know, 1 is the longest time-consuming one which have lots of
> works (except maven building or lint-  java). You don't need to wait more
> in many cases. Yes, in many cases, not all the cases.
>
>
> Case 2: When you prepare a PR on your branch.
>
> If you are at the final commit (maybe already-squashed), just go to
> case 1.
>
> However, usually, we makes lots of commits locally while making
> preparing our PR.
> And, finally we squashed them into one and send a PR to Spark.
> I mean you can use Travis CI during preparing your PRs.
> Again, don't wait for Travis CI. Just push it sometime or at every
> commit, and continue your work.
>
> At the final stage when you finish your coding, squash your commits
> into one,
> and amend your commit title or messages, see the Travis CI.
> Or, you can monitor Travis CI result on status menu bar.
> If it shows green icon, you have nothing to do.
>
>https://docs.travis-ci.com/user/apps/
>
> To sum up, I think we don't need to wait for any CIs. It's like an email.
> `Send and back to work.`
>
> Dongjoon.
>
>
> On Sun, May 22, 2016 at 8:32 PM, Ted Yu  wrote:
>
>> Without Zinc, 'mvn -DskipTests clean install' takes ~30 minutes.
>>
>> Maybe not everyone is willing to wait that long.
>>
>> On Sun, May 22, 2016 at 1:30 PM, Dongjoon Hyun 
>> wrote:
>>
>>> Oh, Sure. My bad!
>>>
>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>> - For Oracle JDK8, mvn -DskipTests install and run `dev/lint-java`.
>>>
>>> Thank you, Ted.
>>>
>>> Dongjoon.
>>>
>>> On Sun, May 22, 2016 at 1:29 PM, Ted Yu  wrote:
>>>
 The following line was repeated twice:

 - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.

 Did you intend to cover JDK 8 ?

 Cheers

 On Sun, May 22, 2016 at 1:25 PM, Dongjoon Hyun 
 wrote:

> Hi, All.
>
> I want to propose the followings.
>
> - Turn on Travis CI for Apache Spark PR queue.
> - Recommend this for contributors, too
>
> Currently, Spark provides Travis CI configuration file to help
> contributors check Scala/Java style conformance and JDK7/8 compilation
> easily during their preparing pull requests. Please note that it's only
> about static analysis.
>
> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
> Scalastyle is included in the step 'mvn install', too.
>
> Yep, if you turn on your Travis CI configuration, you can already see
> the results on your branches before making PR. I wrote this email to
> prevent more failures proactively and community-widely.
>
> For stability, I have been monitoring that for two weeks. It detects
> the failures or recovery on JDK7 builds or Java linter on Spark master
> branch correctly. The only exceptional case I observed rarely is `timeout`
> failure due to hangs of maven. But, as we know, it's happen in our Jenkins
> SparkPullRequestBuilder, too. I think we can ignore that.
>
> I'm sure that this will save much more community's efforts on the
> static errors by preventing them at the very early stage. But, there might
> be another reason not to do this. I'm wondering about your thoughts.
>
> I can make a Apache INFRA Jira issue for this if there is some
> consensus.
>
> Warmly,
> Dongjoon.
>


>>>
>>
>


Re: [VOTE] Removing module maintainer process

2016-05-23 Thread Tom Graves
+1 (binding)
Tom 

On Sunday, May 22, 2016 7:34 PM, Matei Zaharia  
wrote:
 

 It looks like the discussion thread on this has only had positive replies, so 
I'm going to call a VOTE. The proposal is to remove the maintainer process in 
https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers
 

 given that it doesn't seem to have had a huge impact on the project, and it 
can unnecessarily create friction in contributing. We already have +1s from 
Mridul, Tom, Andrew Or and Imran on that thread.

I'll leave the VOTE open for 48 hours, until 9 PM EST on May 24, 2016.

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org


   

Re: I will fix SPARK-15477

2016-05-23 Thread Sean Owen
You don't need it assigned but I do not know if this is a valid change. If
in doubt try to get confirmation from someone familiar with the
implications that it is correct

On Mon, May 23, 2016, 04:27 马骉  wrote:

> Hi
> I will fix Spark 15477
> Can you please assign the contributor permission to me?
> I've start using spark since two years ago, and familiar with spark source
> code and coding style.
>
> Warmest regards~
> From: Biao Ma
>


Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Hyukjin Kwon
+1 -  I wouldn't be bothered if a build becomes longer if I can write
cleaner codes without manually running it.

I have just looked though the related PRs and JIRAs and it looks generally
okay and reasonable to me.

2016-05-23 18:54 GMT+09:00 Steve Loughran :

>
> On 23 May 2016, at 05:21, Dongjoon Hyun  wrote:
>
> Thank you for feedback. Sure, correctly, that's the reason why the current
> SparkPullRequestBuilder do not run `lint-java`. :-)
>
> In addition, that's the same reason why contributors are reluctant to run
> `lint-java` and causes breaking on JDK7 builds.
>
> Such a tedious and time-consuming job should be done by CI without human
> interventions.
>
> By the way, why do you think we need to wait for that? We should not wait
> for any CIs, we should continue our own work.
>
>
>
> +1
>
> Any time you spend waiting for tests to complete is time you could be
> doing useful things. I've had VMs running jenkins watching git branches do
> this in the past: every time I push
>
>
>
> My proposal isn't for making you wait to watch the result. There are two
> use cases I want for us to focus here.
>
> Case 1: When you make a PR to Spark PR queue.
>
> Travis CI will finish before SparkPullRequestBuilder.
> We will run the followings in parallel mode.
>  1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no Java
> Linter)
>  2. Travis: JDK7 + mvn build + Java Linter
>  3. Travis: JDK8 + mvn build + Java Linter
>  As we know, 1 is the longest time-consuming one which have lots of
> works (except maven building or lint-  java). You don't need to wait more
> in many cases. Yes, in many cases, not all the cases.
>
>
> Case 2: When you prepare a PR on your branch.
>
> If you are at the final commit (maybe already-squashed), just go to
> case 1.
>
> However, usually, we makes lots of commits locally while making
> preparing our PR.
> And, finally we squashed them into one and send a PR to Spark.
> I mean you can use Travis CI during preparing your PRs.
> Again, don't wait for Travis CI. Just push it sometime or at every
> commit, and continue your work.
>
> At the final stage when you finish your coding, squash your commits
> into one,
> and amend your commit title or messages, see the Travis CI.
> Or, you can monitor Travis CI result on status menu bar.
> If it shows green icon, you have nothing to do.
>
>https://docs.travis-ci.com/user/apps/
>
> To sum up, I think we don't need to wait for any CIs. It's like an email.
> `Send and back to work.`
>
>
>
> I'd add another, which is "do a build and test of this patch while I get
> on with something else, that is, things which aren't ready for review, just
> the work you've done in the past hour or two which you'd like tested out
>


Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Steve Loughran

On 23 May 2016, at 05:21, Dongjoon Hyun 
> wrote:

Thank you for feedback. Sure, correctly, that's the reason why the current 
SparkPullRequestBuilder do not run `lint-java`. :-)

In addition, that's the same reason why contributors are reluctant to run 
`lint-java` and causes breaking on JDK7 builds.

Such a tedious and time-consuming job should be done by CI without human 
interventions.

By the way, why do you think we need to wait for that? We should not wait for 
any CIs, we should continue our own work.


+1

Any time you spend waiting for tests to complete is time you could be doing 
useful things. I've had VMs running jenkins watching git branches do this in 
the past: every time I push



My proposal isn't for making you wait to watch the result. There are two use 
cases I want for us to focus here.

Case 1: When you make a PR to Spark PR queue.

Travis CI will finish before SparkPullRequestBuilder.
We will run the followings in parallel mode.
 1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no Java Linter)
 2. Travis: JDK7 + mvn build + Java Linter
 3. Travis: JDK8 + mvn build + Java Linter
 As we know, 1 is the longest time-consuming one which have lots of works 
(except maven building or lint-  java). You don't need to wait more in many 
cases. Yes, in many cases, not all the cases.


Case 2: When you prepare a PR on your branch.

If you are at the final commit (maybe already-squashed), just go to case 1.

However, usually, we makes lots of commits locally while making preparing 
our PR.
And, finally we squashed them into one and send a PR to Spark.
I mean you can use Travis CI during preparing your PRs.
Again, don't wait for Travis CI. Just push it sometime or at every commit, 
and continue your work.

At the final stage when you finish your coding, squash your commits into 
one,
and amend your commit title or messages, see the Travis CI.
Or, you can monitor Travis CI result on status menu bar.
If it shows green icon, you have nothing to do.

   https://docs.travis-ci.com/user/apps/

To sum up, I think we don't need to wait for any CIs. It's like an email. `Send 
and back to work.`



I'd add another, which is "do a build and test of this patch while I get on 
with something else, that is, things which aren't ready for review, just the 
work you've done in the past hour or two which you'd like tested out


Re: Building spark master failed

2016-05-23 Thread Ovidiu-Cristian MARCU
You’re right, I tought latest will only compile against Java8.
Thanks
 
> On 23 May 2016, at 11:35, Dongjoon Hyun  wrote:
> 
> Hi, 
> 
> That is not the latest. 
> 
> The bug was fixed 5 days ago.
> 
> Regards,
> Dongjoon.
> 
> 
> On Mon, May 23, 2016 at 2:16 AM, Ovidiu-Cristian MARCU 
> > 
> wrote:
> Hi
> 
> I have the following issue when trying to build the latest spark source code 
> on master:
> 
> /spark/common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java:147:
>  error: cannot find symbol
> [error]   if (process != null && process.isAlive()) {
> [error] ^
> [error]   symbol:   method isAlive()
> [error]   location: variable process of type Process
> [error] 1 error
> [error] Compile failed at May 23, 2016 11:13:58 AM [1.319s]
> 
> related to [INFO] Spark Project Networking ... 
> FAILURE [  1.495 s]
> 
> Am I missing some fix?
> 
> Thanks
> 
> Best,
> Ovidiu
> 



Re: Building spark master failed

2016-05-23 Thread Dongjoon Hyun
Hi,

That is not the latest.

The bug was fixed 5 days ago.

Regards,
Dongjoon.


On Mon, May 23, 2016 at 2:16 AM, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> Hi
>
> I have the following issue when trying to build the latest spark source
> code on master:
>
> /spark/common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java:147:
> error: cannot find symbol
> [error]   if (process != null && process.isAlive()) {
> [error] ^
> [error]   symbol:   method isAlive()
> [error]   location: variable process of type Process
> [error] 1 error
> [error] Compile failed at May 23, 2016 11:13:58 AM [1.319s]
>
> related to [INFO] Spark Project Networking ...
> FAILURE [  1.495 s]
>
> Am I missing some fix?
>
> Thanks
>
> Best,
> Ovidiu
>


I will fix SPARK-15477

2016-05-23 Thread ???T
Hi
I will fix Spark 15477
Can you please assign the contributor permission to me?
I've start using spark since two years ago, and familiar with spark source code 
and coding style.


Warmest regards~
From: Biao Ma

Building spark master failed

2016-05-23 Thread Ovidiu-Cristian MARCU
Hi

I have the following issue when trying to build the latest spark source code on 
master:

/spark/common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java:147:
 error: cannot find symbol
[error]   if (process != null && process.isAlive()) {
[error] ^
[error]   symbol:   method isAlive()
[error]   location: variable process of type Process
[error] 1 error
[error] Compile failed at May 23, 2016 11:13:58 AM [1.319s]

related to [INFO] Spark Project Networking ... FAILURE 
[  1.495 s]

Am I missing some fix?

Thanks

Best,
Ovidiu

Re: spark on kubernetes

2016-05-23 Thread Gurvinder Singh
Hi Reynold,

So if that's OK with you, can I go ahead and create JIRA for this. As it
seems this feature is missing currently and can benefit not just for
kubernetes users but in general Spark standalone mode users too.

- Gurvinder
On 05/22/2016 12:49 PM, Gurvinder Singh wrote:
> On 05/22/2016 10:23 AM, Sun Rui wrote:
>> If it is possible to rewrite URL in outbound responses in Knox or other 
>> reverse proxy, would that solve your issue?
> Any process which can keep track of workers and application drivers IP
> addresses and route traffic to those will work. Considering Spark Master
> does exactly this due to all workers and application has to register to
> the master, therefore I propose master to be the place to add such a
> functionality.
> 
> I am not aware with Knox capabilities but Nginx or any other normal
> reverse proxy will not be able to this on its own due to dynamic nature
> of application drivers and to some extent workers too.
> 
> - Gurvinder
>>> On May 22, 2016, at 14:55, Gurvinder Singh  
>>> wrote:
>>>
>>> On 05/22/2016 08:32 AM, Reynold Xin wrote:
 Kubernetes itself already has facilities for http proxy, doesn't it?

>>> Yeah kubernetes has ingress controller which can act the L7 load
>>> balancer and router traffic to Spark UI in this case. But I am referring
>>> to link present in UI to worker and application UI. Replied in the
>>> detail to Sun Rui's mail where I gave example of possible scenario.
>>>
>>> - Gurvinder

 On Sat, May 21, 2016 at 9:30 AM, Gurvinder Singh
 > wrote:

Hi,

I am currently working on deploying Spark on kuberentes (K8s) and it is
working fine. I am running Spark with standalone mode and checkpointing
the state to shared system. So if master fails K8s starts it and from
checkpoint it recover the earlier state and things just works fine. I
have an issue with the Spark master Web UI to access the worker and
application UI links. In brief, kubernetes service model allows me to
expose the master service to internet, but accessing the
application/workers UI is not possible as then I have to expose them too
individually and given I can have multiple application it becomes hard
to manage.

One solution can be that the master can act as reverse proxy to access
information/state/logs from application/workers. As it has the
information about their endpoint when application/worker register with
master, so when a user initiate a request to access the information,
master can proxy the request to corresponding endpoint.

So I am wondering if someone has already done work in this direction
then it would be great to know. If not then would the community will be
interesting in such feature. If yes then how and where I should get
started as it would be helpful for me to have some guidance to start
working on this.

Kind Regards,
Gurvinder

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org

For additional commands, e-mail: dev-h...@spark.apache.org



>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Removing module maintainer process

2016-05-23 Thread Nick Pentreath
+1 (binding)
On Mon, 23 May 2016 at 04:19, Matei Zaharia  wrote:

> Correction, let's run this for 72 hours, so until 9 PM EST May 25th.
>
> > On May 22, 2016, at 8:34 PM, Matei Zaharia 
> wrote:
> >
> > It looks like the discussion thread on this has only had positive
> replies, so I'm going to call a VOTE. The proposal is to remove the
> maintainer process in
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers
> <
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers>
> given that it doesn't seem to have had a huge impact on the project, and it
> can unnecessarily create friction in contributing. We already have +1s from
> Mridul, Tom, Andrew Or and Imran on that thread.
> >
> > I'll leave the VOTE open for 48 hours, until 9 PM EST on May 24, 2016.
> >
> > Matei
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>