Re: Welcoming Yanbo Liang as a committer

2016-06-03 Thread Dongjoon Hyun
Wow, Congratulations, Yanbo!

Dongjoon.

On Fri, Jun 3, 2016 at 8:22 PM, Xiao Li  wrote:

> Congratulations, Yanbo!
>
> 2016-06-03 19:54 GMT-07:00 Nan Zhu :
>
>> Congratulations !
>>
>> --
>> Nan Zhu
>>
>> On June 3, 2016 at 10:50:33 PM, Ted Yu (yuzhih...@gmail.com) wrote:
>>
>> Congratulations, Yanbo.
>>
>> On Fri, Jun 3, 2016 at 7:48 PM, Matei Zaharia 
>> wrote:
>>
>>> Hi all,
>>>
>>> The PMC recently voted to add Yanbo Liang as a committer. Yanbo has been
>>> a super active contributor in many areas of MLlib. Please join me in
>>> welcoming Yanbo!
>>>
>>> Matei
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>


Re: BUILD FAILURE due to...Unable to find configuration file at location dev/scalastyle-config.xml

2016-03-07 Thread Dongjoon Hyun
Or, I hope one of committer commits both mine(11567) and that soon.
It's related to build setting files, Jenkins test tooks over 2 hours. :(

Dongjoon.

On Mon, Mar 7, 2016 at 11:48 PM, Dongjoon Hyun <dongj...@apache.org> wrote:

> Ur, may I include that, too?
>
> Dongjoon.
>
> On Mon, Mar 7, 2016 at 11:46 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Okey...it's building now
>> properly...https://github.com/apache/spark/pull/11567 + git mv
>> scalastyle-config.xml dev/
>>
>> How to fix it in the repo? Should I send a pull request to...pull
>> request #11567? Guide me or fix it yourself...somehow :-)
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Tue, Mar 8, 2016 at 8:42 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>> > Hi,
>> >
>> > Nope. It's not. It's at
>> >
>> https://github.com/apache/spark/commit/0eea12a3d956b54bbbd73d21b296868852a04494#diff-600376dffeb79835ede4a0b285078036L2249
>> .
>> > I've got that and testing...
>> >
>> > Pozdrawiam,
>> > Jacek Laskowski
>> > 
>> > https://medium.com/@jaceklaskowski/
>> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> > Follow me at https://twitter.com/jaceklaskowski
>> >
>> >
>> > On Tue, Mar 8, 2016 at 8:41 AM, Shixiong(Ryan) Zhu
>> > <shixi...@databricks.com> wrote:
>> >> There is a fix: https://github.com/apache/spark/pull/11567
>> >>
>> >> On Mon, Mar 7, 2016 at 11:39 PM, Reynold Xin <r...@databricks.com>
>> wrote:
>> >>>
>> >>> +Sean, who was playing with this.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Mar 7, 2016 at 11:38 PM, Jacek Laskowski <ja...@japila.pl>
>> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> Got the BUILD FAILURE. Anyone looking into it?
>> >>>>
>> >>>> ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
>> >>>> -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver -DskipTests clean
>> >>>> install
>> >>>> ...
>> >>>> [INFO]
>> >>>>
>> 
>> >>>> [INFO] BUILD FAILURE
>> >>>> [INFO]
>> >>>>
>> 
>> >>>> [INFO] Total time: 2.837 s
>> >>>> [INFO] Finished at: 2016-03-08T08:19:36+01:00
>> >>>> [INFO] Final Memory: 50M/581M
>> >>>> [INFO]
>> >>>>
>> 
>> >>>> [ERROR] Failed to execute goal
>> >>>> org.scalastyle:scalastyle-maven-plugin:0.8.0:check (default) on
>> >>>> project spark-parent_2.11: Failed during scalastyle execution: Unable
>> >>>> to find configuration file at location dev/scalastyle-config.xml ->
>> >>>> [Help 1]
>> >>>> [ERROR]
>> >>>> [ERROR] To see the full stack trace of the errors, re-run Maven with
>> >>>> the -e switch.
>> >>>> [ERROR] Re-run Maven using the -X switch to enable full debug
>> logging.
>> >>>> [ERROR]
>> >>>> [ERROR] For more information about the errors and possible solutions,
>> >>>> please read the following articles:
>> >>>> [ERROR] [Help 1]
>> >>>>
>> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
>> >>>>
>> >>>> Pozdrawiam,
>> >>>> Jacek Laskowski
>> >>>> 
>> >>>> https://medium.com/@jaceklaskowski/
>> >>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> >>>> Follow me at https://twitter.com/jaceklaskowski
>> >>>>
>> >>>> -
>> >>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >>>> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>>>
>> >>>
>> >>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: BUILD FAILURE due to...Unable to find configuration file at location dev/scalastyle-config.xml

2016-03-08 Thread Dongjoon Hyun
Hi, I updated PR https://github.com/apache/spark/pull/11567.

But, `lint-java` fails if that file is in the dev folder. (Jenkins fails,
too.)

So, inevitably, I changed pom.xml instead.

Dongjoon.


On Mon, Mar 7, 2016 at 11:40 PM, Jacek Laskowski  wrote:

> Hi,
>
> At first glance it appears the commit *yesterday* (Warsaw time) broke
> the build :(
>
>
> https://github.com/apache/spark/commit/0eea12a3d956b54bbbd73d21b296868852a04494
>
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Mar 8, 2016 at 8:38 AM, Jacek Laskowski  wrote:
> > Hi,
> >
> > Got the BUILD FAILURE. Anyone looking into it?
> >
> > ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
> > -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver -DskipTests clean
> > install
> > ...
> > [INFO]
> 
> > [INFO] BUILD FAILURE
> > [INFO]
> 
> > [INFO] Total time: 2.837 s
> > [INFO] Finished at: 2016-03-08T08:19:36+01:00
> > [INFO] Final Memory: 50M/581M
> > [INFO]
> 
> > [ERROR] Failed to execute goal
> > org.scalastyle:scalastyle-maven-plugin:0.8.0:check (default) on
> > project spark-parent_2.11: Failed during scalastyle execution: Unable
> > to find configuration file at location dev/scalastyle-config.xml ->
> > [Help 1]
> > [ERROR]
> > [ERROR] To see the full stack trace of the errors, re-run Maven with
> > the -e switch.
> > [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> > [ERROR]
> > [ERROR] For more information about the errors and possible solutions,
> > please read the following articles:
> > [ERROR] [Help 1]
> > http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: BUILD FAILURE due to...Unable to find configuration file at location dev/scalastyle-config.xml

2016-03-07 Thread Dongjoon Hyun
Ur, may I include that, too?

Dongjoon.

On Mon, Mar 7, 2016 at 11:46 PM, Jacek Laskowski  wrote:

> Okey...it's building now
> properly...https://github.com/apache/spark/pull/11567 + git mv
> scalastyle-config.xml dev/
>
> How to fix it in the repo? Should I send a pull request to...pull
> request #11567? Guide me or fix it yourself...somehow :-)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Mar 8, 2016 at 8:42 AM, Jacek Laskowski  wrote:
> > Hi,
> >
> > Nope. It's not. It's at
> >
> https://github.com/apache/spark/commit/0eea12a3d956b54bbbd73d21b296868852a04494#diff-600376dffeb79835ede4a0b285078036L2249
> .
> > I've got that and testing...
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> >
> > On Tue, Mar 8, 2016 at 8:41 AM, Shixiong(Ryan) Zhu
> >  wrote:
> >> There is a fix: https://github.com/apache/spark/pull/11567
> >>
> >> On Mon, Mar 7, 2016 at 11:39 PM, Reynold Xin 
> wrote:
> >>>
> >>> +Sean, who was playing with this.
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Mar 7, 2016 at 11:38 PM, Jacek Laskowski 
> wrote:
> 
>  Hi,
> 
>  Got the BUILD FAILURE. Anyone looking into it?
> 
>  ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
>  -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver -DskipTests clean
>  install
>  ...
>  [INFO]
> 
> 
>  [INFO] BUILD FAILURE
>  [INFO]
> 
> 
>  [INFO] Total time: 2.837 s
>  [INFO] Finished at: 2016-03-08T08:19:36+01:00
>  [INFO] Final Memory: 50M/581M
>  [INFO]
> 
> 
>  [ERROR] Failed to execute goal
>  org.scalastyle:scalastyle-maven-plugin:0.8.0:check (default) on
>  project spark-parent_2.11: Failed during scalastyle execution: Unable
>  to find configuration file at location dev/scalastyle-config.xml ->
>  [Help 1]
>  [ERROR]
>  [ERROR] To see the full stack trace of the errors, re-run Maven with
>  the -e switch.
>  [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>  [ERROR]
>  [ERROR] For more information about the errors and possible solutions,
>  please read the following articles:
>  [ERROR] [Help 1]
> 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> 
>  Pozdrawiam,
>  Jacek Laskowski
>  
>  https://medium.com/@jaceklaskowski/
>  Mastering Apache Spark http://bit.ly/mastering-apache-spark
>  Follow me at https://twitter.com/jaceklaskowski
> 
>  -
>  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>  For additional commands, e-mail: dev-h...@spark.apache.org
> 
> >>>
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Dongjoon Hyun
Thank you, Steve and Hyukjin.

And, don't worry, Ted.

Travis launches new VMs for every PR.

Apache Spark repository uses the following setting.

VM: Google Compute Engine
OS: Ubuntu 14.04.3 LTS Server Edition 64bit
CPU: ~2 CORE
RAM: 7.5GB

FYI, you can find more information about this here.

https://docs.travis-ci.com/user/ci-environment/#Virtualization-environments

Dongjoon.



On Mon, May 23, 2016 at 6:32 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Do you know if more than one PR would be verified on the same machine ?
>
> I wonder whether the 'mvn install' from two simultaneous PR builds may
> have conflict.
>
> On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun <dongj...@apache.org>
> wrote:
>
>> Thank you for feedback. Sure, correctly, that's the reason why the
>> current SparkPullRequestBuilder do not run `lint-java`. :-)
>>
>> In addition, that's the same reason why contributors are reluctant to run
>> `lint-java` and causes breaking on JDK7 builds.
>>
>> Such a tedious and time-consuming job should be done by CI without human
>> interventions.
>>
>> By the way, why do you think we need to wait for that? We should not wait
>> for any CIs, we should continue our own work.
>>
>> My proposal isn't for making you wait to watch the result. There are two
>> use cases I want for us to focus here.
>>
>> Case 1: When you make a PR to Spark PR queue.
>>
>> Travis CI will finish before SparkPullRequestBuilder.
>> We will run the followings in parallel mode.
>>  1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no Java
>> Linter)
>>  2. Travis: JDK7 + mvn build + Java Linter
>>  3. Travis: JDK8 + mvn build + Java Linter
>>  As we know, 1 is the longest time-consuming one which have lots of
>> works (except maven building or lint-  java). You don't need to wait more
>> in many cases. Yes, in many cases, not all the cases.
>>
>>
>> Case 2: When you prepare a PR on your branch.
>>
>> If you are at the final commit (maybe already-squashed), just go to
>> case 1.
>>
>> However, usually, we makes lots of commits locally while making
>> preparing our PR.
>> And, finally we squashed them into one and send a PR to Spark.
>> I mean you can use Travis CI during preparing your PRs.
>> Again, don't wait for Travis CI. Just push it sometime or at every
>> commit, and continue your work.
>>
>> At the final stage when you finish your coding, squash your commits
>> into one,
>> and amend your commit title or messages, see the Travis CI.
>> Or, you can monitor Travis CI result on status menu bar.
>> If it shows green icon, you have nothing to do.
>>
>>https://docs.travis-ci.com/user/apps/
>>
>> To sum up, I think we don't need to wait for any CIs. It's like an email.
>> `Send and back to work.`
>>
>> Dongjoon.
>>
>>
>> On Sun, May 22, 2016 at 8:32 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Without Zinc, 'mvn -DskipTests clean install' takes ~30 minutes.
>>>
>>> Maybe not everyone is willing to wait that long.
>>>
>>> On Sun, May 22, 2016 at 1:30 PM, Dongjoon Hyun <dongj...@apache.org>
>>> wrote:
>>>
>>>> Oh, Sure. My bad!
>>>>
>>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>>> - For Oracle JDK8, mvn -DskipTests install and run `dev/lint-java`.
>>>>
>>>> Thank you, Ted.
>>>>
>>>> Dongjoon.
>>>>
>>>> On Sun, May 22, 2016 at 1:29 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> The following line was repeated twice:
>>>>>
>>>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>>>>
>>>>> Did you intend to cover JDK 8 ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sun, May 22, 2016 at 1:25 PM, Dongjoon Hyun <dongj...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> I want to propose the followings.
>>>>>>
>>>>>> - Turn on Travis CI for Apache Spark PR queue.
>>>>>> - Recommend this for contributors, too
>>>>>>
>>>>>> Currently, Spark provides Travis CI configuration file to help
>>>>>> contributors check Scala/Java style conformance and JDK7/8 compilation
>>>>>> easily 

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Dongjoon Hyun
Thank you, Shane!

I really hope that SparkPullRequestBuilder handle them if possible.

Dongjoon.

On Mon, May 23, 2016 at 1:24 PM, Dongjoon Hyun <dongj...@apache.org> wrote:

> Thank you for your opinion!
>
> Sure. I know that history and totally agree with all your concerns.
> I indeed has hesitated about sending this kind of suggestion for a while.
>
> If Travis CI cannot handle those simple jobs at this time again,
> we must turn off from Spark PR queue.
> We can see the result quickly in one or two days.
> To turn on/off, Spark have nothing to do. INFRA team will do that.
>
> In fact, the goal is not about using another CI (like Travis), it is about
> preventing the followings.
>
> 1. JDK7 compilation errors. (Recently, 2 days ago and 5 days ago)
> 2. Java static errors. (Not critical but more frequently.)
> 3. Maven installation errors. (A month ago, it's reported in this mailing
> list.)
>
> Scala 2.10 compilation errors are fixed nearly instantly. But, 1~3 were
> not.
> If SparkPullRequestBuilder can do the above 1~3, that's the best for us.
> Do you think it is possible in some ways?
>
> By the way, as of today, Spark has 724 Java files and 96762 lines (without
> comment/blank).
> It's about 1/3 of Scala code. It's not small.
> --
> Language  files  blankcomment code
> --
> Scala  2368  63578 124904   322518
> Java724  18569  23445
> 96762
>
> Dongjoon.
>
>
>
> On Mon, May 23, 2016 at 12:20 PM, Michael Armbrust <mich...@databricks.com
> > wrote:
>
>> We did turn on travis a few years ago, but ended up turning it off
>> because it was failing (I believe because of insufficient resources) which
>> was confusing for developers.  I wouldn't be opposed to turning it on if it
>> provides more/faster signal, but its not obvious to me that it would.  In
>> particular, do we know that given the rate PRs are created if we will hit
>> rate limits?
>>
>> Really my main feedback is, if the java linter is important we should
>> probably have it as part of the canonical build process.  I worry about
>> having more than one set of CI infrastructure to maintain.
>>
>> On Mon, May 23, 2016 at 9:43 AM, Dongjoon Hyun <dongj...@apache.org>
>> wrote:
>>
>>> Thank you, Steve and Hyukjin.
>>>
>>> And, don't worry, Ted.
>>>
>>> Travis launches new VMs for every PR.
>>>
>>> Apache Spark repository uses the following setting.
>>>
>>> VM: Google Compute Engine
>>> OS: Ubuntu 14.04.3 LTS Server Edition 64bit
>>> CPU: ~2 CORE
>>> RAM: 7.5GB
>>>
>>> FYI, you can find more information about this here.
>>>
>>>
>>> https://docs.travis-ci.com/user/ci-environment/#Virtualization-environments
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>> On Mon, May 23, 2016 at 6:32 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Do you know if more than one PR would be verified on the same machine ?
>>>>
>>>> I wonder whether the 'mvn install' from two simultaneous PR builds may
>>>> have conflict.
>>>>
>>>> On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun <dongj...@apache.org>
>>>> wrote:
>>>>
>>>>> Thank you for feedback. Sure, correctly, that's the reason why the
>>>>> current SparkPullRequestBuilder do not run `lint-java`. :-)
>>>>>
>>>>> In addition, that's the same reason why contributors are reluctant to
>>>>> run `lint-java` and causes breaking on JDK7 builds.
>>>>>
>>>>> Such a tedious and time-consuming job should be done by CI without
>>>>> human interventions.
>>>>>
>>>>> By the way, why do you think we need to wait for that? We should not
>>>>> wait for any CIs, we should continue our own work.
>>>>>
>>>>> My proposal isn't for making you wait to watch the result. There are
>>>>> two use cases I want for us to focus here.
>>>>>
>>>>> Case 1: When you make a PR to Spark PR queue.
>>>>>
>>>>> Travis CI will finish before SparkPullRequestBuilder.
>>>>> We will run the followings in parallel mode.
>>>>>  1. Current SparkPullRequestBuilde

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Dongjoon Hyun
 Thank you for your opinion!

Sure. I know that history and totally agree with all your concerns.
I indeed has hesitated about sending this kind of suggestion for a while.

If Travis CI cannot handle those simple jobs at this time again,
we must turn off from Spark PR queue.
We can see the result quickly in one or two days.
To turn on/off, Spark have nothing to do. INFRA team will do that.

In fact, the goal is not about using another CI (like Travis), it is about
preventing the followings.

1. JDK7 compilation errors. (Recently, 2 days ago and 5 days ago)
2. Java static errors. (Not critical but more frequently.)
3. Maven installation errors. (A month ago, it's reported in this mailing
list.)

Scala 2.10 compilation errors are fixed nearly instantly. But, 1~3 were not.
If SparkPullRequestBuilder can do the above 1~3, that's the best for us.
Do you think it is possible in some ways?

By the way, as of today, Spark has 724 Java files and 96762 lines (without
comment/blank).
It's about 1/3 of Scala code. It's not small.
--
Language  files  blankcomment code
--
Scala  2368  63578 124904   322518
Java724  18569  23445
96762

Dongjoon.



On Mon, May 23, 2016 at 12:20 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> We did turn on travis a few years ago, but ended up turning it off because
> it was failing (I believe because of insufficient resources) which was
> confusing for developers.  I wouldn't be opposed to turning it on if it
> provides more/faster signal, but its not obvious to me that it would.  In
> particular, do we know that given the rate PRs are created if we will hit
> rate limits?
>
> Really my main feedback is, if the java linter is important we should
> probably have it as part of the canonical build process.  I worry about
> having more than one set of CI infrastructure to maintain.
>
> On Mon, May 23, 2016 at 9:43 AM, Dongjoon Hyun <dongj...@apache.org>
> wrote:
>
>> Thank you, Steve and Hyukjin.
>>
>> And, don't worry, Ted.
>>
>> Travis launches new VMs for every PR.
>>
>> Apache Spark repository uses the following setting.
>>
>> VM: Google Compute Engine
>> OS: Ubuntu 14.04.3 LTS Server Edition 64bit
>> CPU: ~2 CORE
>> RAM: 7.5GB
>>
>> FYI, you can find more information about this here.
>>
>>
>> https://docs.travis-ci.com/user/ci-environment/#Virtualization-environments
>>
>> Dongjoon.
>>
>>
>>
>> On Mon, May 23, 2016 at 6:32 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Do you know if more than one PR would be verified on the same machine ?
>>>
>>> I wonder whether the 'mvn install' from two simultaneous PR builds may
>>> have conflict.
>>>
>>> On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun <dongj...@apache.org>
>>> wrote:
>>>
>>>> Thank you for feedback. Sure, correctly, that's the reason why the
>>>> current SparkPullRequestBuilder do not run `lint-java`. :-)
>>>>
>>>> In addition, that's the same reason why contributors are reluctant to
>>>> run `lint-java` and causes breaking on JDK7 builds.
>>>>
>>>> Such a tedious and time-consuming job should be done by CI without
>>>> human interventions.
>>>>
>>>> By the way, why do you think we need to wait for that? We should not
>>>> wait for any CIs, we should continue our own work.
>>>>
>>>> My proposal isn't for making you wait to watch the result. There are
>>>> two use cases I want for us to focus here.
>>>>
>>>> Case 1: When you make a PR to Spark PR queue.
>>>>
>>>> Travis CI will finish before SparkPullRequestBuilder.
>>>> We will run the followings in parallel mode.
>>>>  1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no
>>>> Java Linter)
>>>>  2. Travis: JDK7 + mvn build + Java Linter
>>>>  3. Travis: JDK8 + mvn build + Java Linter
>>>>  As we know, 1 is the longest time-consuming one which have lots of
>>>> works (except maven building or lint-  java). You don't need to wait more
>>>> in many cases. Yes, in many cases, not all the cases.
>>>>
>>>>
>>>> Case 2: When you prepare a PR on your branch.
>>>>
>>>> If you are at the final commit (maybe already-squashed), just go 

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Dongjoon Hyun
I want to clarify something here.

For Travis CI, it's free for open source projects and there is only one
management point, `.travis.xml`, for Spark community.

It's not some like physical Jenkins cluster farm. It's just a cloud service
like Github.

PS.
I'm also not an employee of Travis(or Github). :-)
If Spark uses Travis CI freely, they might dislike me for the heavy traffic.


On Mon, May 23, 2016 at 1:26 PM, Dongjoon Hyun <dongj...@apache.org> wrote:

> Thank you, Shane!
>
> I really hope that SparkPullRequestBuilder handle them if possible.
>
> Dongjoon.
>
> On Mon, May 23, 2016 at 1:24 PM, Dongjoon Hyun <dongj...@apache.org>
> wrote:
>
>> Thank you for your opinion!
>>
>> Sure. I know that history and totally agree with all your concerns.
>> I indeed has hesitated about sending this kind of suggestion for a while.
>>
>> If Travis CI cannot handle those simple jobs at this time again,
>> we must turn off from Spark PR queue.
>> We can see the result quickly in one or two days.
>> To turn on/off, Spark have nothing to do. INFRA team will do that.
>>
>> In fact, the goal is not about using another CI (like Travis), it is
>> about preventing the followings.
>>
>> 1. JDK7 compilation errors. (Recently, 2 days ago and 5 days ago)
>> 2. Java static errors. (Not critical but more frequently.)
>> 3. Maven installation errors. (A month ago, it's reported in this mailing
>> list.)
>>
>> Scala 2.10 compilation errors are fixed nearly instantly. But, 1~3 were
>> not.
>> If SparkPullRequestBuilder can do the above 1~3, that's the best for us.
>> Do you think it is possible in some ways?
>>
>> By the way, as of today, Spark has 724 Java files and 96762 lines
>> (without comment/blank).
>> It's about 1/3 of Scala code. It's not small.
>> --
>> Language  files  blankcomment code
>> --
>> Scala  2368  63578 124904   322518
>> Java724  18569  23445
>> 96762
>>
>> Dongjoon.
>>
>>
>>
>> On Mon, May 23, 2016 at 12:20 PM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> We did turn on travis a few years ago, but ended up turning it off
>>> because it was failing (I believe because of insufficient resources) which
>>> was confusing for developers.  I wouldn't be opposed to turning it on if it
>>> provides more/faster signal, but its not obvious to me that it would.  In
>>> particular, do we know that given the rate PRs are created if we will hit
>>> rate limits?
>>>
>>> Really my main feedback is, if the java linter is important we should
>>> probably have it as part of the canonical build process.  I worry about
>>> having more than one set of CI infrastructure to maintain.
>>>
>>> On Mon, May 23, 2016 at 9:43 AM, Dongjoon Hyun <dongj...@apache.org>
>>> wrote:
>>>
>>>> Thank you, Steve and Hyukjin.
>>>>
>>>> And, don't worry, Ted.
>>>>
>>>> Travis launches new VMs for every PR.
>>>>
>>>> Apache Spark repository uses the following setting.
>>>>
>>>> VM: Google Compute Engine
>>>> OS: Ubuntu 14.04.3 LTS Server Edition 64bit
>>>> CPU: ~2 CORE
>>>> RAM: 7.5GB
>>>>
>>>> FYI, you can find more information about this here.
>>>>
>>>>
>>>> https://docs.travis-ci.com/user/ci-environment/#Virtualization-environments
>>>>
>>>> Dongjoon.
>>>>
>>>>
>>>>
>>>> On Mon, May 23, 2016 at 6:32 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> Do you know if more than one PR would be verified on the same machine ?
>>>>>
>>>>> I wonder whether the 'mvn install' from two simultaneous PR builds may
>>>>> have conflict.
>>>>>
>>>>> On Sun, May 22, 2016 at 9:21 PM, Dongjoon Hyun <dongj...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Thank you for feedback. Sure, correctly, that's the reason why the
>>>>>> current SparkPullRequestBuilder do not run `lint-java`. :-)
>>>>>>
>>>>>> In addition, that's the same reason why contributors are reluctant to
>>>>>> run `lint-ja

Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-24 Thread Dongjoon Hyun
Yep. Let's hold on. :)

On Tue, May 24, 2016 at 3:45 PM, shane knapp  wrote:

> > Sure, could you give me the permission for Spark Jira?
> >
> > Although we haven't decided yet, I can add Travis related section
> > (summarizing current configurations and expected VM HW, etc).
> >
> i can't give you permissions -- that has to be (most likely) through
> someone @ databricks, like michael.
>
> let's hold off on adding a section until we actually decide that this
> is critical and something that cannot be done currently w/jenkins.
>
> > It's just a Wiki, you can delete the Travis Section anytime.
> >
> ayup.
>


Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-23 Thread Dongjoon Hyun
Thank you, Sean!

On Mon, May 23, 2016 at 2:09 PM, Sean Owen  wrote:

> No, because then none of the Java 8 support can build. Marcelo has a JIRA
> for handling that the right way with bootstrap class path config.
>
> Ideally it can be rolled into Jenkins though there are possibly historical
> reasons it was not enabled before. Best to fix those if possible but if not
> I'd rather have some automated checking than none. Checking lint is
> reasonably important.
>
> On Mon, May 23, 2016, 15:58 Ted Yu  wrote:
>
>> For #1 below, currently Jenkins uses Java 8:
>>
>> JAVA_HOME=/usr/java/jdk1.8.0_60
>>
>>
>> How about switching to Java 7 ?
>>
>>
>> My two cents.
>>
>>
>>


Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-22 Thread Dongjoon Hyun
Thank you for feedback. Sure, correctly, that's the reason why the current
SparkPullRequestBuilder do not run `lint-java`. :-)

In addition, that's the same reason why contributors are reluctant to run
`lint-java` and causes breaking on JDK7 builds.

Such a tedious and time-consuming job should be done by CI without human
interventions.

By the way, why do you think we need to wait for that? We should not wait
for any CIs, we should continue our own work.

My proposal isn't for making you wait to watch the result. There are two
use cases I want for us to focus here.

Case 1: When you make a PR to Spark PR queue.

Travis CI will finish before SparkPullRequestBuilder.
We will run the followings in parallel mode.
 1. Current SparkPullRequestBuilder: JDK8 + sbt build + (no Java
Linter)
 2. Travis: JDK7 + mvn build + Java Linter
 3. Travis: JDK8 + mvn build + Java Linter
 As we know, 1 is the longest time-consuming one which have lots of
works (except maven building or lint-  java). You don't need to wait more
in many cases. Yes, in many cases, not all the cases.


Case 2: When you prepare a PR on your branch.

If you are at the final commit (maybe already-squashed), just go to
case 1.

However, usually, we makes lots of commits locally while making
preparing our PR.
And, finally we squashed them into one and send a PR to Spark.
I mean you can use Travis CI during preparing your PRs.
Again, don't wait for Travis CI. Just push it sometime or at every
commit, and continue your work.

At the final stage when you finish your coding, squash your commits
into one,
and amend your commit title or messages, see the Travis CI.
Or, you can monitor Travis CI result on status menu bar.
If it shows green icon, you have nothing to do.

   https://docs.travis-ci.com/user/apps/

To sum up, I think we don't need to wait for any CIs. It's like an email.
`Send and back to work.`

Dongjoon.


On Sun, May 22, 2016 at 8:32 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Without Zinc, 'mvn -DskipTests clean install' takes ~30 minutes.
>
> Maybe not everyone is willing to wait that long.
>
> On Sun, May 22, 2016 at 1:30 PM, Dongjoon Hyun <dongj...@apache.org>
> wrote:
>
>> Oh, Sure. My bad!
>>
>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>> - For Oracle JDK8, mvn -DskipTests install and run `dev/lint-java`.
>>
>> Thank you, Ted.
>>
>> Dongjoon.
>>
>> On Sun, May 22, 2016 at 1:29 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> The following line was repeated twice:
>>>
>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>>
>>> Did you intend to cover JDK 8 ?
>>>
>>> Cheers
>>>
>>> On Sun, May 22, 2016 at 1:25 PM, Dongjoon Hyun <dongj...@apache.org>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> I want to propose the followings.
>>>>
>>>> - Turn on Travis CI for Apache Spark PR queue.
>>>> - Recommend this for contributors, too
>>>>
>>>> Currently, Spark provides Travis CI configuration file to help
>>>> contributors check Scala/Java style conformance and JDK7/8 compilation
>>>> easily during their preparing pull requests. Please note that it's only
>>>> about static analysis.
>>>>
>>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>>>> Scalastyle is included in the step 'mvn install', too.
>>>>
>>>> Yep, if you turn on your Travis CI configuration, you can already see
>>>> the results on your branches before making PR. I wrote this email to
>>>> prevent more failures proactively and community-widely.
>>>>
>>>> For stability, I have been monitoring that for two weeks. It detects
>>>> the failures or recovery on JDK7 builds or Java linter on Spark master
>>>> branch correctly. The only exceptional case I observed rarely is `timeout`
>>>> failure due to hangs of maven. But, as we know, it's happen in our Jenkins
>>>> SparkPullRequestBuilder, too. I think we can ignore that.
>>>>
>>>> I'm sure that this will save much more community's efforts on the
>>>> static errors by preventing them at the very early stage. But, there might
>>>> be another reason not to do this. I'm wondering about your thoughts.
>>>>
>>>> I can make a Apache INFRA Jira issue for this if there is some
>>>> consensus.
>>>>
>>>> Warmly,
>>>> Dongjoon.
>>>>
>>>
>>>
>>
>


Using Travis for JDK7/8 compilation and lint-java.

2016-05-22 Thread Dongjoon Hyun
Hi, All.

I want to propose the followings.

- Turn on Travis CI for Apache Spark PR queue.
- Recommend this for contributors, too

Currently, Spark provides Travis CI configuration file to help contributors
check Scala/Java style conformance and JDK7/8 compilation easily during
their preparing pull requests. Please note that it's only about static
analysis.

- For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
- For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
Scalastyle is included in the step 'mvn install', too.

Yep, if you turn on your Travis CI configuration, you can already see the
results on your branches before making PR. I wrote this email to prevent
more failures proactively and community-widely.

For stability, I have been monitoring that for two weeks. It detects the
failures or recovery on JDK7 builds or Java linter on Spark master branch
correctly. The only exceptional case I observed rarely is `timeout` failure
due to hangs of maven. But, as we know, it's happen in our Jenkins
SparkPullRequestBuilder, too. I think we can ignore that.

I'm sure that this will save much more community's efforts on the static
errors by preventing them at the very early stage. But, there might be
another reason not to do this. I'm wondering about your thoughts.

I can make a Apache INFRA Jira issue for this if there is some consensus.

Warmly,
Dongjoon.


Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-22 Thread Dongjoon Hyun
Oh, Sure. My bad!

- For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
- For Oracle JDK8, mvn -DskipTests install and run `dev/lint-java`.

Thank you, Ted.

Dongjoon.

On Sun, May 22, 2016 at 1:29 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> The following line was repeated twice:
>
> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>
> Did you intend to cover JDK 8 ?
>
> Cheers
>
> On Sun, May 22, 2016 at 1:25 PM, Dongjoon Hyun <dongj...@apache.org>
> wrote:
>
>> Hi, All.
>>
>> I want to propose the followings.
>>
>> - Turn on Travis CI for Apache Spark PR queue.
>> - Recommend this for contributors, too
>>
>> Currently, Spark provides Travis CI configuration file to help
>> contributors check Scala/Java style conformance and JDK7/8 compilation
>> easily during their preparing pull requests. Please note that it's only
>> about static analysis.
>>
>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>> - For Oracle JDK7, mvn -DskipTests install and run `dev/lint-java`.
>> Scalastyle is included in the step 'mvn install', too.
>>
>> Yep, if you turn on your Travis CI configuration, you can already see the
>> results on your branches before making PR. I wrote this email to prevent
>> more failures proactively and community-widely.
>>
>> For stability, I have been monitoring that for two weeks. It detects the
>> failures or recovery on JDK7 builds or Java linter on Spark master branch
>> correctly. The only exceptional case I observed rarely is `timeout` failure
>> due to hangs of maven. But, as we know, it's happen in our Jenkins
>> SparkPullRequestBuilder, too. I think we can ignore that.
>>
>> I'm sure that this will save much more community's efforts on the static
>> errors by preventing them at the very early stage. But, there might be
>> another reason not to do this. I'm wondering about your thoughts.
>>
>> I can make a Apache INFRA Jira issue for this if there is some consensus.
>>
>> Warmly,
>> Dongjoon.
>>
>
>


Re: Building spark master failed

2016-05-23 Thread Dongjoon Hyun
Hi,

That is not the latest.

The bug was fixed 5 days ago.

Regards,
Dongjoon.


On Mon, May 23, 2016 at 2:16 AM, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> Hi
>
> I have the following issue when trying to build the latest spark source
> code on master:
>
> /spark/common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java:147:
> error: cannot find symbol
> [error]   if (process != null && process.isAlive()) {
> [error] ^
> [error]   symbol:   method isAlive()
> [error]   location: variable process of type Process
> [error] 1 error
> [error] Compile failed at May 23, 2016 11:13:58 AM [1.319s]
>
> related to [INFO] Spark Project Networking ...
> FAILURE [  1.495 s]
>
> Am I missing some fix?
>
> Thanks
>
> Best,
> Ovidiu
>


Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-24 Thread Dongjoon Hyun
Hi, All.

As Sean said, Vanzin made a PR for JDK7 compilation. We can ignore the
issue of JDK7 compilation.

The remaining issues are the java-linter and maven installation test.

To: Michael
For the rate limit, Apache Foundation seems to use 30 concurrent according
to the INFRA blog.

https://blogs.apache.org/infra/

However, this does not include the Travis CI queue in personal users.
We have over 800 github registered contributors.
They can run Travis CI on their branch without this limitation like me.

To: Shane
For Travis CI, I volunteer for turning on/off and taking care about
complains during that test period.

I hope these two helps you make a positive decision on Travis CI test drive.
Any other potential issues we should consider?

Dongjoon.


Re: Using Travis for JDK7/8 compilation and lint-java.

2016-05-24 Thread Dongjoon Hyun
Thank you, Shane.

Sure, could you give me the permission for Spark Jira?

Although we haven't decided yet, I can add Travis related section
(summarizing current configurations and expected VM HW, etc).

That will be helpful for further discussions.

It's just a Wiki, you can delete the Travis Section anytime.

Dongjoon.


On Tue, May 24, 2016 at 12:15 PM, shane knapp  wrote:

> > As Sean said, Vanzin made a PR for JDK7 compilation. We can ignore the
> issue
> > of JDK7 compilation.
> >
> vanzin and i are working together on this right now...  we currently
> have java 7u79 installed on all of the workers.  if some random test
> failures keep happening during his tests, i will roll out 7u80 (which
> is known to be 'good') and set that as the default java on all of the
> workers (which is currently 7-79).
>
> > To: Shane
> > For Travis CI, I volunteer for turning on/off and taking care about
> > complains during that test period.
> >
> > I hope these two helps you make a positive decision on Travis CI test
> drive.
> > Any other potential issues we should consider?
> >
> if this is the route we decide to go, could you please update the qa
> infra wiki entry and add a relevant section for the travis setup?
>
> https://cwiki.apache.org/confluence/display/SPARK/Spark+QA+Infrastructure
>
> thanks,
>
> shane
>


Re: Internal Deprecation warnings - worth fixing?

2016-07-27 Thread Dongjoon Hyun
+1 for fixing :)

Dongjoon.

On Wed, Jul 27, 2016 at 12:53 PM, Nick Pentreath 
wrote:

> +1 I don't believe there's any reason for the warnings to still be there
> except for available dev time & focus :)
>
> On Wed, 27 Jul 2016 at 21:35, Jacek Laskowski  wrote:
>
>> Kill 'em all -- one by one slowly yet gradually! :)
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Wed, Jul 27, 2016 at 9:11 PM, Holden Karau 
>> wrote:
>> > Now that the 2.0 release is out the door and I've got some cycles to do
>> some
>> > cleanups -  I'd like to know what other people think of the internal
>> > deprecation warnings we've introduced in a lot of a places in our code.
>> Once
>> > before I did some minor refactoring so the Python code which had to use
>> the
>> > deprecated code to expose the deprecated API wouldn't gum up the build
>> logs
>> > - but is there interest in doing that or are we more interested in not
>> > paying attention to the deprecation warnings for internal Spark
>> components
>> > (e.g. https://twitter.com/thepracticaldev/status/725769766603001856 )?
>> >
>> >
>> > --
>> > Cell : 425-233-8271
>> > Twitter: https://twitter.com/holdenkarau
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: drop java 7 support for spark 2.1.x or spark 2.2.x

2016-07-23 Thread Dongjoon Hyun
Hi, All.

What about providing a official benchmark result between `Apache Spark on
JDK7` and `Apache Spark on JDK8`?
I think that is enough for this issue since we cannot drive users.
We had better let users choose one of JDK7/JDK8 for their own benefits.

Bests,
Dongjoon.



On Sat, Jul 23, 2016 at 4:35 PM, Koert Kuipers  wrote:

> They don't require dropping it sooner rather than later. But signalling in
> some way that java 8 is (strongly) recommend would be good.
>
> On Jul 23, 2016 6:59 PM, "Mark Hamstra"  wrote:
>
>> Sure, signalling well ahead of time is good, as is getting better
>> performance from Java 8; but do either of those interests really require
>> dropping Java 7 support sooner rather than later?
>>
>> Now, to retroactively copy edit myself, when I previously wrote "after
>> all or nearly all relevant clusters are actually no longer running on Java
>> 6", I meant "...no longer running on Java 7".  We should be at a point now
>> where there aren't many Java 6 clusters left, but my sense is that there
>> are still quite a number of Java 7 clusters around, and that there will be
>> for a good while still.
>>
>> On Sat, Jul 23, 2016 at 3:50 PM, Koert Kuipers  wrote:
>>
>>> i care about signalling it in advance mostly. and given the performance
>>> differences we do have some interest in pushing towards java 8
>>>
>>> On Jul 23, 2016 6:10 PM, "Mark Hamstra"  wrote:
>>>
>>> Why the push to remove Java 7 support as soon as possible (which is how
>>> I read your "cluster admins plan to migrate by date X, so Spark should end
>>> Java 7 support then, too")?  First, I don't think we should be removing
>>> Java 7 support until some time after all or nearly all relevant clusters
>>> are actually no longer running on Java 6, and that targeting removal of
>>> support at our best guess about when admins are just *planning* to migrate
>>> isn't a very good idea.  Second, I don't see the significant difficulty or
>>> harm in continuing to support Java 7 for a while longer.
>>>
>>> On Sat, Jul 23, 2016 at 2:54 PM, Koert Kuipers 
>>> wrote:
>>>
 dropping java 7 support was considered for spark 2.0.x but we decided
 against it.

 ideally dropping support for a java version should be communicated far
 in advance to facilitate the transition.

 is this the right time to make that decision and start communicating it
 (mailing list, jira, etc.)? perhaps for spark 2.1.x or spark 2.2.x?

 my general sense is that most cluster admins have plans to migrate to
 java 8 before end of year. so that could line up nicely with spark 2.2


>>>
>>>
>>


Re: Spark Homepage

2016-07-13 Thread Dongjoon Hyun
Oh, thank you, Holden.
Next time, I'll search user@ first before asking here. :)

Dongjoon

On Wed, Jul 13, 2016 at 12:07 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> This has also been reported on the user@ by a few people - other apache
> projects (arrow & hadoop) don't seem to be affected so maybe it was a just
> bad update for the Spark website?
>
> On Wed, Jul 13, 2016 at 12:05 PM, Dongjoon Hyun <dongj...@apache.org>
> wrote:
>
>> Hi, All.
>>
>> Currently, Spark Homepage (http://spark.apache.org/) shows file listing
>> (containing md files)
>> Is there any maintenance operation on that? :)
>>
>> Warmly,
>> Dongjoon.
>>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Spark Homepage

2016-07-13 Thread Dongjoon Hyun
Hi, All.

Currently, Spark Homepage (http://spark.apache.org/) shows file listing
(containing md files)
Is there any maintenance operation on that? :)

Warmly,
Dongjoon.


Re: Spark Homepage

2016-07-13 Thread Dongjoon Hyun
Thank you for the pointer, Reynold!
I see the situation now.

On Wed, Jul 13, 2016 at 12:09 PM, Reynold Xin <r...@databricks.com> wrote:

> It's related to
> https://issues.apache.org/jira/servicedesk/agent/INFRA/issue/INFRA-12055
>
>
>
> On Wed, Jul 13, 2016 at 12:07 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> This has also been reported on the user@ by a few people - other apache
>> projects (arrow & hadoop) don't seem to be affected so maybe it was a just
>> bad update for the Spark website?
>>
>> On Wed, Jul 13, 2016 at 12:05 PM, Dongjoon Hyun <dongj...@apache.org>
>> wrote:
>>
>>> Hi, All.
>>>
>>> Currently, Spark Homepage (http://spark.apache.org/) shows file listing
>>> (containing md files)
>>> Is there any maintenance operation on that? :)
>>>
>>> Warmly,
>>> Dongjoon.
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


Re: What's the meaning of Target Version/s in Spark's JIRA?

2016-06-28 Thread Dongjoon Hyun
Hi,

1.6.2 is just the result of back-porting of that patch.

The patch was originally targeted and merged into 2.0.0.

Warmly,
Dongjoon.


On Tue, Jun 28, 2016 at 10:54 AM, Jacek Laskowski  wrote:

> Hi,
>
> While reviewing the release notes for 1.6.2 I stumbled upon
> https://issues.apache.org/jira/browse/SPARK-13522. It's got Target
> Version/s: 2.0.0 with Fix Version/s: 1.6.2, 2.0.0.
>
> What's the meaning of Target Version/s in Spark?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Welcoming Felix Cheung as a committer

2016-08-08 Thread Dongjoon Hyun
Congratulation, Felix!

Bests,
Dongjoon.

On Monday, August 8, 2016, Ted Yu  wrote:

> Congratulations, Felix.
>
> On Mon, Aug 8, 2016 at 11:15 AM, Matei Zaharia  > wrote:
>
>> Hi all,
>>
>> The PMC recently voted to add Felix Cheung as a committer. Felix has been
>> a major contributor to SparkR and we're excited to have him join
>> officially. Congrats and welcome, Felix!
>>
>> Matei
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
>>
>>
>


Re: welcoming Burak and Holden as committers

2017-01-24 Thread Dongjoon Hyun
Great! Congratulations, Burak and Holden.

Bests,
Dongjoon.

On 2017-01-24 10:29 (-0800), Nicholas Chammas  
wrote: 
> 👏 👍
> 
> Congratulations, Burak and Holden.
> 
> On Tue, Jan 24, 2017 at 1:27 PM Russell Spitzer 
> wrote:
> 
> > Great news! Congratulations!
> >
> > On Tue, Jan 24, 2017 at 10:25 AM Dean Wampler 
> > wrote:
> >
> > Congratulations to both of you!
> >
> > dean
> >
> > *Dean Wampler, Ph.D.*
> > Author: Programming Scala, 2nd Edition
> > , Fast Data
> > Architectures for Streaming Applications
> > ,
> > Functional Programming for Java Developers
> > , and Programming Hive
> >  (O'Reilly)
> > Lightbend 
> > @deanwampler 
> > http://polyglotprogramming.com
> > https://github.com/deanwampler
> >
> > On Tue, Jan 24, 2017 at 6:14 PM, Xiao Li  wrote:
> >
> > Congratulations! Burak and Holden!
> >
> > 2017-01-24 10:13 GMT-08:00 Reynold Xin :
> >
> > Hi all,
> >
> > Burak and Holden have recently been elected as Apache Spark committers.
> >
> > Burak has been very active in a large number of areas in Spark, including
> > linear algebra, stats/maths functions in DataFrames, Python/R APIs for
> > DataFrames, dstream, and most recently Structured Streaming.
> >
> > Holden has been a long time Spark contributor and evangelist. She has
> > written a few books on Spark, as well as frequent contributions to the
> > Python API to improve its usability and performance.
> >
> > Please join me in welcoming the two!
> >
> >
> >
> >
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Weird experience Hive with Spark Transformations

2017-01-17 Thread Dongjoon Hyun
Hi, Chetan.

Did you copy your `hive-site.xml` into Spark conf directory? For example,

cp /usr/local/hive/conf/hive-site.xml /usr/local/spark/conf

If you want to use the existing Hive metastore, you need to provide that 
information to Spark.

Bests,
Dongjoon.

On 2017-01-16 21:36 (-0800), Chetan Khatri  wrote: 
> Hello,
> 
> I have following services are configured and installed successfully:
> 
> Hadoop 2.7.x
> Spark 2.0.x
> HBase 1.2.4
> Hive 1.2.1
> 
> *Installation Directories:*
> 
> /usr/local/hadoop
> /usr/local/spark
> /usr/local/hbase
> 
> *Hive Environment variables:*
> 
> #HIVE VARIABLES START
> export HIVE_HOME=/usr/local/hive
> export PATH=$PATH:$HIVE_HOME/bin
> #HIVE VARIABLES END
> 
> So, I can access Hive from anywhere as environment variables are
> configured. Now if if i start my spark-shell & hive from location
> /usr/local/hive then both work good for hive-metastore other wise from
> where i start spark-shell where spark creates own meta-store.
> 
> i.e I am reading from HBase and Writing to Hive using Spark. I dont know
> why this is weird issue is.
> 
> 
> 
> 
> Thanks.
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: GraphX-related "open" issues

2017-01-17 Thread Dongjoon Hyun
Hi, Takeshi.

> So, IMO it seems okay to close tickets about "Improvement" and "New Feature" 
> for now.

I'm just wondering about what kind of field value you want to fill in the 
`Resolution` field for those issues.

Maybe, 'Later'? Or, 'Won't Fix'?

Bests,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Support for Hive 2.x

2016-09-02 Thread Dongjoon Hyun
Hi, Rostyslav,

After your email, I also tried to search in this morning, but I didn't find
a proper one.

The last related issue is SPARK-8064, `Upgrade Hive to 1.2`

https://issues.apache.org/jira/browse/SPARK-8064

If you want, you can file an JIRA issue including your pain points, then
you can monitor through it.

I guess you have more reasons to do that, not just a compilation issue.

Bests,
Dongjoon.



On Fri, Sep 2, 2016 at 12:51 AM, Rostyslav Sotnychenko <
r.sotnyche...@gmail.com> wrote:

> Hello!
>
> I tried compiling Spark 2.0 with Hive 2.0, but as expected this failed.
>
> So I am wondering if there is any talks going on about adding support of
> Hive 2.x to Spark? I was unable to find any JIRA about this.
>
>
> Thanks,
> Rostyslav
>
>


Re: Mesos is now a maven module

2016-08-30 Thread Dongjoon Hyun
Thank you for confirming, Sean and Marcelo!

Bests,
Dongjoon.

On Tue, Aug 30, 2016 at 10:05 AM, Marcelo Vanzin <van...@cloudera.com>
wrote:

> A quick look shows that maybe dev/sparktestsupport/modules.py needs to
> be modified, and a "build_profile_flags" added to the mesos section
> (similar to hive / hive-thriftserver).
>
> Note not all PR builds will trigger mesos currently, since it's listed
> as an independent module in the above file.
>
> On Tue, Aug 30, 2016 at 10:01 AM, Sean Owen <so...@cloudera.com> wrote:
> > I have the heady power to modify Jenkins jobs now, so I will carefully
> take
> > a look at them and see if any of the config needs -Pmesos. But yeah I
> > thought this should be baked into the script.
> >
> > On Tue, Aug 30, 2016 at 5:56 PM, Dongjoon Hyun <dongj...@apache.org>
> wrote:
> >>
> >> Hi, Michael.
> >>
> >> It's a great news!
> >>
> >> BTW, I'm wondering if the Jenkins (SparkPullRequestBuilder) knows this
> new
> >> profile, -Pmesos.
> >>
> >> The PR was passed with the following Jenkins build arguments without
> >> `-Pmesos` option. (at the last test)
> >> ```
> >> [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments:
> >> -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive-thriftserver -Phive
> test:package
> >> streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly
> >> streaming-kinesis-asl-assembly/assembly
> >> ```
> >>
> >> https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/64435/consoleFull
> >>
> >> Also, up to now, Jenkins seems not to use '-Pmesos' for all PRs.
> >>
> >> Bests,
> >> Dongjoon.
> >>
> >>
> >> On Fri, Aug 26, 2016 at 3:19 PM, Michael Gummelt <
> mgumm...@mesosphere.io>
> >> wrote:
> >>>
> >>> If it's separable, then sure.  Consistency is nice.
> >>>
> >>> On Fri, Aug 26, 2016 at 2:14 PM, Jacek Laskowski <ja...@japila.pl>
> wrote:
> >>>>
> >>>> Hi Michael,
> >>>>
> >>>> Congrats!
> >>>>
> >>>> BTW What I like about the change the most is that it uses the
> >>>> pluggable interface for TaskScheduler and SchedulerBackend (as
> >>>> introduced by YARN). Think Standalone should follow the steps. WDYT?
> >>>>
> >>>> Pozdrawiam,
> >>>> Jacek Laskowski
> >>>> 
> >>>> https://medium.com/@jaceklaskowski/
> >>>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> >>>> Follow me at https://twitter.com/jaceklaskowski
> >>>>
> >>>>
> >>>> On Fri, Aug 26, 2016 at 10:20 PM, Michael Gummelt
> >>>> <mgumm...@mesosphere.io> wrote:
> >>>> > Hello devs,
> >>>> >
> >>>> > Much like YARN, Mesos has been refactored into a Maven module.  So
> >>>> > when
> >>>> > building, you must add "-Pmesos" to enable Mesos support.
> >>>> >
> >>>> > The pre-built distributions from Apache will continue to enable
> Mesos.
> >>>> >
> >>>> > PR: https://github.com/apache/spark/pull/14637
> >>>> >
> >>>> > Cheers
> >>>> >
> >>>> > --
> >>>> > Michael Gummelt
> >>>> > Software Engineer
> >>>> > Mesosphere
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Michael Gummelt
> >>> Software Engineer
> >>> Mesosphere
> >>
> >>
> >
>
>
>
> --
> Marcelo
>


Re: Mesos is now a maven module

2016-08-30 Thread Dongjoon Hyun
Hi, Michael.

It's a great news!

BTW, I'm wondering if the Jenkins (SparkPullRequestBuilder) knows this new
profile, -Pmesos.

The PR was passed with the following Jenkins build arguments without
`-Pmesos` option. (at the last test)
```
[info] Building Spark (w/Hive 1.2.1) using SBT with these arguments:
 -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive-thriftserver -Phive test:package
streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly
streaming-kinesis-asl-assembly/assembly
```
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64435/consoleFull

Also, up to now, Jenkins seems not to use '-Pmesos' for all PRs.

Bests,
Dongjoon.


On Fri, Aug 26, 2016 at 3:19 PM, Michael Gummelt 
wrote:

> If it's separable, then sure.  Consistency is nice.
>
> On Fri, Aug 26, 2016 at 2:14 PM, Jacek Laskowski  wrote:
>
>> Hi Michael,
>>
>> Congrats!
>>
>> BTW What I like about the change the most is that it uses the
>> pluggable interface for TaskScheduler and SchedulerBackend (as
>> introduced by YARN). Think Standalone should follow the steps. WDYT?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Aug 26, 2016 at 10:20 PM, Michael Gummelt
>>  wrote:
>> > Hello devs,
>> >
>> > Much like YARN, Mesos has been refactored into a Maven module.  So when
>> > building, you must add "-Pmesos" to enable Mesos support.
>> >
>> > The pre-built distributions from Apache will continue to enable Mesos.
>> >
>> > PR: https://github.com/apache/spark/pull/14637
>> >
>> > Cheers
>> >
>> > --
>> > Michael Gummelt
>> > Software Engineer
>> > Mesosphere
>>
>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


Re: Mesos is now a maven module

2016-08-30 Thread Dongjoon Hyun
Thank you all for quick fix! :D

Dongjoon.

On Tuesday, August 30, 2016, Michael Gummelt  wrote:

> https://github.com/apache/spark/pull/14885
>
> Thanks
>
> On Tue, Aug 30, 2016 at 11:36 AM, Marcelo Vanzin  > wrote:
>
>> On Tue, Aug 30, 2016 at 11:32 AM, Sean Owen > > wrote:
>> > Ah, I helped miss that. We don't enable -Pyarn for YARN because it's
>> > already always set? I wonder if it makes sense to make that optional
>> > in order to speed up builds, or, maybe I'm missing a reason it's
>> > always essential.
>>
>> YARN is currently handled as part of the Hadoop profiles in
>> dev/run-tests.py; it could potentially be changed to behave like the
>> others (e.g. only enabled when the YARN code changes).
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


Re: [VOTE] Release Apache Spark 2.0.1 (RC2)

2016-09-24 Thread Dongjoon Hyun
+1 (non binding)

I compiled and tested on the following two systems.

- CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1 with -Pyarn -Phadoop-2.7
-Pkinesis-asl -Phive -Phive-thriftserver -Dsparkr
- CentOS 7.2 / Open JDK 1.8.0_102 with -Pyarn -Phadoop-2.7 -Pkinesis-asl
-Phive -Phive-thriftserver

Bests,
Dongjoon.


On Fri, Sep 23, 2016 at 3:32 PM, Jacek Laskowski  wrote:

> Hi,
>
> Not that it could fix the issue but no -Pmesos?
>
> Jacek
>
> On 24 Sep 2016 12:08 a.m., "Sean Owen"  wrote:
>
>> +1 Signatures and hashes check out. I checked that the Kinesis
>> assembly artifacts are not present.
>>
>> I compiled and tested on Java 8 / Ubuntu 16 with -Pyarn -Phive
>> -Phive-thriftserver -Phadoop-2.7 -Psparkr and only saw one test
>> problem. This test never completed. If nobody else sees it, +1,
>> assuming it's a bad test or env issue.
>>
>> - should clone and clean line object in ClosureCleaner *** FAILED ***
>>   isContain was true Interpreter output contained 'Exception':
>>   Welcome to
>>   __
>>/ __/__  ___ _/ /__
>>   _\ \/ _ \/ _ `/ __/  '_/
>>  /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
>> /_/
>>
>>   Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_91)
>>   Type in expressions to have them evaluated.
>>   Type :help for more information.
>>
>>   scala> // Entering paste mode (ctrl-D to finish)
>>
>>
>>   // Exiting paste mode, now interpreting.
>>
>>   org.apache.spark.SparkException: Job 0 cancelled because
>> SparkContext was shut down
>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfte
>> rSchedulerStop$1.apply(DAGScheduler.scala:818)
>> ...
>>
>>
>> On Fri, Sep 23, 2016 at 7:01 AM, Reynold Xin  wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 2.0.1. The vote is open until Sunday, Sep 25, 2016 at 23:59 PDT and
>> passes
>> > if a majority of at least 3+1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.0.1
>> > [ ] -1 Do not release this package because ...
>> >
>> >
>> > The tag to be voted on is v2.0.1-rc2
>> > (04141ad49806a48afccc236b699827997142bd57)
>> >
>> > This release candidate resolves 284 issues:
>> > https://s.apache.org/spark-2.0.1-jira
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc2-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1199
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc2-docs/
>> >
>> >
>> > Q: How can I help test this release?
>> > A: If you are a Spark user, you can help us test this release by taking
>> an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions from 2.0.0.
>> >
>> > Q: What justifies a -1 vote for this release?
>> > A: This is a maintenance release in the 2.0.x series.  Bugs already
>> present
>> > in 2.0.0, missing features, or bugs related to new features will not
>> > necessarily block this release.
>> >
>> > Q: What happened to 2.0.1 RC1?
>> > A: There was an issue with RC1 R documentation during release candidate
>> > preparation. As a result, rc1 was canceled before a vote was called.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-25 Thread Dongjoon Hyun
+1 (non binding)

RC3 is compiled and tested on the following two systems, too. All tests
passed.

* CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1
   with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
-Dsparkr
* CentOS 7.2 / Open JDK 1.8.0_102
   with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver

Cheers,
Dongjoon



On Saturday, September 24, 2016, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.1
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.1-rc3 (9d28cc10357a8afcfb2fa2e6eecb5c
> 2cc2730d17)
>
> This release candidate resolves 290 issues: https://s.apache.org/spark-2.
> 0.1-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1201/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.0.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series.  Bugs already
> present in 2.0.0, missing features, or bugs related to new features will
> not necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.1.
>
>
>


Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Dongjoon Hyun
+1 (non-binding)

At this time, I tested RC4 on the followings.

- CentOS 6.8 (Final)
- OpenJDK 1.8.0_101
- Python 2.7.12

/build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver 
-Dpyspark -Dsparkr -DskipTests clean package
/build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver 
-Dpyspark -Dsparkr test
python2.7 python/run-tests.py --python-executables python2.7

All tests are passed.

Cheers,
Dongjoon.

On 2016-09-29 12:20 (-0700), Sameer Agarwal  wrote: 
> +1
> 
> On Thu, Sep 29, 2016 at 12:04 PM, Sean Owen  wrote:
> 
> > +1 from me too, same result as my RC3 vote/testing.
> >
> > On Wed, Sep 28, 2016 at 10:14 PM, Reynold Xin  wrote:
> > > Please vote on releasing the following candidate as Apache Spark version
> > > 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes
> > if a
> > > majority of at least 3+1 PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Spark 2.0.1
> > > [ ] -1 Do not release this package because ...
> > >
> > >
> > > The tag to be voted on is v2.0.1-rc4
> > > (933d2c1ea4e5f5c4ec8d375b5ccaa4577ba4be38)
> > >
> > > This release candidate resolves 301 issues:
> > > https://s.apache.org/spark-2.0.1-jira
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-bin/
> > >
> > > Release artifacts are signed with the following key:
> > > https://people.apache.org/keys/committer/pwendell.asc
> > >
> > > The staging repository for this release can be found at:
> > > https://repository.apache.org/content/repositories/orgapachespark-1203/
> > >
> > > The documentation corresponding to this release can be found at:
> > > http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-docs/
> > >
> > >
> > > Q: How can I help test this release?
> > > A: If you are a Spark user, you can help us test this release by taking
> > an
> > > existing Spark workload and running on this release candidate, then
> > > reporting any regressions from 2.0.0.
> > >
> > > Q: What justifies a -1 vote for this release?
> > > A: This is a maintenance release in the 2.0.x series.  Bugs already
> > present
> > > in 2.0.0, missing features, or bugs related to new features will not
> > > necessarily block this release.
> > >
> > > Q: What fix version should I use for patches merging into branch-2.0 from
> > > now on?
> > > A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
> > > (i.e. RC5) is cut, I will change the fix version of those patches to
> > 2.0.1.
> > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> 
> 
> -- 
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: welcoming Xiao Li as a committer

2016-10-04 Thread Dongjoon Hyun
Congratulations, Xiao!

Bests,
Dongjoon.

On Monday, October 3, 2016, Jagadeesan As  wrote:

> Congratulations Xiao Li.
>
> Cheers
> Jagadeesan A S
>
>
>
> From:Reynold Xin  >
> To:"dev@spark.apache.org
> " <
> dev@spark.apache.org
> >, Xiao Li <
> gatorsm...@gmail.com
> >
> Date:04-10-16 11:17 AM
> Subject:welcoming Xiao Li as a committer
> --
>
>
>
> Hi all,
>
> Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark
> committer. Xiao has been a super active contributor to Spark SQL. Congrats
> and welcome, Xiao!
>
> - Reynold
>
>
>


Re: Straw poll: dropping support for things like Scala 2.10

2016-10-26 Thread Dongjoon Hyun
Hi, All.

It's great since it's a progress.

Then, at least, in 2017, Spark 2.2.0 will be out with JDK8 and Scala 2.11/2.12, 
right?

Bests,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.6.3 (RC1)

2016-10-27 Thread Dongjoon Hyun
Hi, All.

Last time, RC1 passed the tests with only the timezone testcase failure. Now, 
it's backported, too.
I'm wondering if we have other issues to block releasing Apache Spark 1.6.3.

Bests,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Dongjoon Hyun
+1 non-binding.

Built and tested CentOS 6.6 / OpenJDK 1.8.0_111.

Cheers,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-08 Thread Dongjoon Hyun
+1 (non-binding)

It's built and tested on CentOS 6.8 / OpenJDK 1.8.0_111 with `-Pyarn 
-Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver -Psparkr` profile.

Cheers!
Dongjoon.

On 2016-11-08 14:03 (-0800), Michael Armbrust  wrote: 
> +1
> 
> On Tue, Nov 8, 2016 at 1:17 PM, Sean Owen  wrote:
> 
> > +1 binding
> >
> > (See comments on last vote; same results, except, the regression we
> > identified is fixed now.)
> >
> >
> > On Tue, Nov 8, 2016 at 6:10 AM Reynold Xin  wrote:
> >
> >> Please vote on releasing the following candidate as Apache Spark version
> >> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
> >> a majority of at least 3+1 PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Spark 2.0.2
> >> [ ] -1 Do not release this package because ...
> >>
> >>
> >> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b143367
> >> ba694b0c34)
> >>
> >> This release candidate resolves 84 issues: https://s.apache.org/spark-2.
> >> 0.2-jira
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
> >>
> >> Release artifacts are signed with the following key:
> >> https://people.apache.org/keys/committer/pwendell.asc
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1214/
> >>
> >> The documentation corresponding to this release can be found at:
> >> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
> >>
> >>
> >> Q: How can I help test this release?
> >> A: If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate, then
> >> reporting any regressions from 2.0.1.
> >>
> >> Q: What justifies a -1 vote for this release?
> >> A: This is a maintenance release in the 2.0.x series. Bugs already
> >> present in 2.0.1, missing features, or bugs related to new features will
> >> not necessarily block this release.
> >>
> >> Q: What fix version should I use for patches merging into branch-2.0 from
> >> now on?
> >> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> >> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
> >>
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Is `randomized aggregation test` testsuite stable?

2016-11-10 Thread Dongjoon Hyun
Hi, All.

Recently, I observed frequent failures of `randomized aggregation test` of 
ObjectHashAggregateSuite in SparkPullRequestBuilder.

SPARK-17982   https://github.com/apache/spark/pull/15546 (Today)
SPARK-18123   https://github.com/apache/spark/pull/15664 (Today)
SPARK-18169   https://github.com/apache/spark/pull/15682 (Today)
SPARK-18292   https://github.com/apache/spark/pull/15789 (4 days ago. It's gone 
after `retest`)

I'm wondering if anyone meet those failures? Should I file a JIRA issue for 
this?

Bests,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Is `randomized aggregation test` testsuite stable?

2016-11-10 Thread Dongjoon Hyun
Great! Thank you so much, Cheng!

Bests,
Dongjoon.

On 2016-11-10 11:21 (-0800), Cheng Lian <lian.cs@gmail.com> wrote: 
> Hey Dongjoon,
> 
> Thanks for reporting. I'm looking into these OOM errors. Already 
> reproduced them locally but haven't figured out the root cause yet. 
> Gonna disable them temporarily for now.
> 
> Sorry for the inconvenience!
> 
> Cheng
> 
> 
> On 11/10/16 8:48 AM, Dongjoon Hyun wrote:
> > Hi, All.
> >
> > Recently, I observed frequent failures of `randomized aggregation test` of 
> > ObjectHashAggregateSuite in SparkPullRequestBuilder.
> >
> > SPARK-17982   https://github.com/apache/spark/pull/15546 (Today)
> > SPARK-18123   https://github.com/apache/spark/pull/15664 (Today)
> > SPARK-18169   https://github.com/apache/spark/pull/15682 (Today)
> > SPARK-18292   https://github.com/apache/spark/pull/15789 (4 days ago. It's 
> > gone after `retest`)
> >
> > I'm wondering if anyone meet those failures? Should I file a JIRA issue for 
> > this?
> >
> > Bests,
> > Dongjoon.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Two major versions?

2016-11-27 Thread Dongjoon Hyun
Hi, All.

Do we have a release plan of Apache Spark 1.6.4?

Up to my knowledge, Apache Spark community has been focusing on latest two 
versions.
There was no official release of Apache Spark *X.X.4* so far. It's also 
well-documented on Apache Spark home page (Versioning policy; 
http://spark.apache.org/versioning-policy.html)

> A minor release usually sees 1-2 maintenance releases in the 6 months 
> following its first release. 

So, personally, I don't expect Apache Spark 1.6.4. After Apache Spark 2.1 will 
be released very soon, 2.1 and 2.0 will be the two major versions which the 
most community effort is going to focus on.

However, *literally*, two major versions of Apache Spark will be Apache Spark 
1.X (1.6.3) and Apache Spark 2.X. Since there is API compatibility issues 
between major versions, I guess 1.6.X will survive for a while like JDK7.

If possible, could we have a clear statement whether there is a plan for 1.6.4 
on homepage?

Bests,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Two major versions?

2016-11-27 Thread Dongjoon Hyun
Thank you, Reynold. Clear enough as a backporting and as a new release criteria.
Yes. Definitely. If there is no new commit on `branch-1.6`, there is no need to 
release.

However, Apache Spark 1.6.3 was a single release *accumulating* those kind of 
52 patches.
Given that `branch-1.5` has over 100 patches and we don't expect `1.5.3`, 
`branch-1.6` looks to me extraordinary as the last emperor of Apache Spark 1.X.

Bests,
Dongjoon.

On 2016-11-27 12:50 (-0800), Reynold Xin <r...@databricks.com> wrote: 
> I think this highly depends on what issues are found, e.g. critical bugs
> that impact wide use cases, or security bugs.
> 
> 
> On Sun, Nov 27, 2016 at 12:49 PM, Dongjoon Hyun <dongj...@apache.org> wrote:
> 
> > Hi, All.
> >
> > Do we have a release plan of Apache Spark 1.6.4?
> >
> > Up to my knowledge, Apache Spark community has been focusing on latest two
> > versions.
> > There was no official release of Apache Spark *X.X.4* so far. It's also
> > well-documented on Apache Spark home page (Versioning policy;
> > http://spark.apache.org/versioning-policy.html)
> >
> > > A minor release usually sees 1-2 maintenance releases in the 6 months
> > following its first release.
> >
> > So, personally, I don't expect Apache Spark 1.6.4. After Apache Spark 2.1
> > will be released very soon, 2.1 and 2.0 will be the two major versions
> > which the most community effort is going to focus on.
> >
> > However, *literally*, two major versions of Apache Spark will be Apache
> > Spark 1.X (1.6.3) and Apache Spark 2.X. Since there is API compatibility
> > issues between major versions, I guess 1.6.X will survive for a while like
> > JDK7.
> >
> > If possible, could we have a clear statement whether there is a plan for
> > 1.6.4 on homepage?
> >
> > Bests,
> > Dongjoon.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Two major versions?

2016-11-28 Thread Dongjoon Hyun
Thank you, Sean.

Now, I agree with you that it's too rare to do something for Apache Spark
major versions.

Bests,
Dongjoon

On Mon, Nov 28, 2016 at 04:17 Sean Owen <so...@cloudera.com> wrote:

> Yeah, there's no official position on this. BTW see the new home of what
> info is published on this topic:
> http://spark.apache.org/versioning-policy.html
>
> The answer is indeed that minor releases have a target cadence, but
> maintenance releases are as-needed, as defined by the release manager's
> judgment.
>
> From
> https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:versions-panel
>  you
> can see that maintenance releases for a minor release seem to continue for
> 3-6 months in general, with 1.6.x going on for a longer period as the last
> 1.x minor release.
>
> I wouldn't mind putting down some text to set a non-binding expectation
> around this, that minor releases might be 'supported' for 3-6 months? until
> 2 more minor releases have succeeded it? Because in practice that's very
> much how back-ports behave.
>
> We also don't say anything about major releases but I think that may also
> be too rare to put even informal statements around. Every couple years?
>
> On Sun, Nov 27, 2016 at 8:49 PM Dongjoon Hyun <dongj...@apache.org> wrote:
>
> Hi, All.
>
> Do we have a release plan of Apache Spark 1.6.4?
>
> Up to my knowledge, Apache Spark community has been focusing on latest two
> versions.
> There was no official release of Apache Spark *X.X.4* so far. It's also
> well-documented on Apache Spark home page (Versioning policy;
> http://spark.apache.org/versioning-policy.html)
>
> > A minor release usually sees 1-2 maintenance releases in the 6 months
> following its first release.
>
> So, personally, I don't expect Apache Spark 1.6.4. After Apache Spark 2.1
> will be released very soon, 2.1 and 2.0 will be the two major versions
> which the most community effort is going to focus on.
>
> However, *literally*, two major versions of Apache Spark will be Apache
> Spark 1.X (1.6.3) and Apache Spark 2.X. Since there is API compatibility
> issues between major versions, I guess 1.6.X will survive for a while like
> JDK7.
>
> If possible, could we have a clear statement whether there is a plan for
> 1.6.4 on homepage?
>
> Bests,
> Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Running lint-java during PR builds?

2016-11-16 Thread Dongjoon Hyun
Hi, Marcelo and Ryan.

That was the main purpose of my proposal about Travis.CI.
IMO, that is the only way to achieve that without any harmful side-effect on 
Jenkins infra.

Spark is already ready for that. Like AppVoyer, if one of you files an INFRA 
jira issue to enable that, they will turn on that. Then, we can try it and see 
the result. Also, you can turn off easily again if you don't want.

Without this, we will consume more community efforts. For example, we merged 
lint-java error fix PR seven hours ago, but the master branch still has one 
lint-java error.

https://travis-ci.org/dongjoon-hyun/spark/jobs/176351319

Actually, I've been monitoring the history here. (It's synced every 30 minutes.)

https://travis-ci.org/dongjoon-hyun/spark/builds

Could we give a change to this?

Bests,
Dongjoon.

On 2016-11-15 13:40 (-0800), "Shixiong(Ryan) Zhu" <shixi...@databricks.com> 
wrote: 
> I remember it's because you need to run `mvn install` before running
> lint-java if the maven cache is empty, and `mvn install` is pretty heavy.
> 
> On Tue, Nov 15, 2016 at 1:21 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
> 
> > Hey all,
> >
> > Is there a reason why lint-java is not run during PR builds? I see it
> > seems to be maven-only, is it really expensive to run after an sbt
> > build?
> >
> > I see a lot of PRs coming in to fix Java style issues, and those all
> > seem a little unnecessary. Either we're enforcing style checks or
> > we're not, and right now it seems we aren't.
> >
> > --
> > Marcelo
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-11 Thread Dongjoon Hyun
Hi.

Now, do we have Apache Spark 2.0.2? :)

Bests,
Dongjoon.

On 2016-11-07 22:09 (-0800), Reynold Xin  wrote: 
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
> 
> 
> The tag to be voted on is v2.0.2-rc3
> (584354eaac02531c9584188b143367ba694b0c34)
> 
> This release candidate resolves 84 issues:
> https://s.apache.org/spark-2.0.2-jira
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1214/
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
> 
> 
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
> 
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
> 
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-02 Thread Dongjoon Hyun
Hi, Sean.

The same failure blocks me, too.

- SPARK-18189: Fix serialization issue in KeyValueGroupedDataset *** FAILED ***

I used `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver -Dsparkr` 
on CentOS 7 / OpenJDK1.8.0_111.

Dongjoon.

On 2016-11-02 10:44 (-0700), Sean Owen  wrote: 
> Sigs, license, etc are OK. There are no Blockers for 2.0.2, though here are
> the 4 issues still open:
> 
> SPARK-14387 Enable Hive-1.x ORC compatibility with
> spark.sql.hive.convertMetastoreOrc
> SPARK-17957 Calling outer join and na.fill(0) and then inner join will miss
> rows
> SPARK-17981 Incorrectly Set Nullability to False in FilterExec
> SPARK-18160 spark.files & spark.jars should not be passed to driver in yarn
> mode
> 
> Running with Java 8, -Pyarn -Phive -Phive-thriftserver -Phadoop-2.7 on
> Ubuntu 16, I am seeing consistent failures in this test below. I think we
> very recently changed this so it could be legitimate. But does anyone else
> see something like this? I have seen other failures in this test due to OOM
> but my MAVEN_OPTS allows 6g of heap, which ought to be plenty.
> 
> 
> - SPARK-18189: Fix serialization issue in KeyValueGroupedDataset *** FAILED
> ***
>   isContain was true Interpreter output contained 'Exception':
>   Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version 2.0.2
> /_/
> 
>   Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_102)
>   Type in expressions to have them evaluated.
>   Type :help for more information.
> 
>   scala>
>   scala> keyValueGrouped:
> org.apache.spark.sql.KeyValueGroupedDataset[Int,(Int, Int)] =
> org.apache.spark.sql.KeyValueGroupedDataset@70c30f72
> 
>   scala> mapGroups: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int,
> _2: int]
> 
>   scala> broadcasted: org.apache.spark.broadcast.Broadcast[Int] =
> Broadcast(0)
> 
>   scala>
>   scala>
>   scala> dataset: org.apache.spark.sql.Dataset[Int] = [value: int]
> 
>   scala> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in
> stage 0.0 (TID 0, localhost):
> com.google.common.util.concurrent.ExecutionError:
> java.lang.ClassCircularityError:
> io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:841)
>   at
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:188)
>   at
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
>   at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:825)
>   at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:822)
>   at
> org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:137)
>   at
> org.apache.spark.sql.execution.AppendColumnsExec$$anonfun$9.apply(objects.scala:251)
>   at
> org.apache.spark.sql.execution.AppendColumnsExec$$anonfun$9.apply(objects.scala:250)
>   at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>   Caused by: java.lang.ClassCircularityError:
> io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at
> 

Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Dongjoon Hyun
+1 (non-binding)

It's built and tested on CentOS 6.8 / OpenJDK 1.8.0_111, too.

Cheers,
Dongjoon.

On 2016-11-03 14:30 (-0700), Davies Liu  wrote: 
> +1
> 
> On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin  wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a
> > majority of at least 3+1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.6.3
> > [ ] -1 Do not release this package because ...
> >
> >
> > The tag to be voted on is v1.6.3-rc2
> > (1e860747458d74a4ccbd081103a0542a2367b14b)
> >
> > This release candidate addresses 52 JIRA tickets:
> > https://s.apache.org/spark-1.6.3-jira
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1212/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-docs/
> >
> >
> > ===
> > == How can I help test this release?
> > ===
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions from 1.6.2.
> >
> > 
> > == What justifies a -1 vote for this release?
> > 
> > This is a maintenance release in the 1.6.x series.  Bugs already present in
> > 1.6.2, missing features, or bugs related to new features will not
> > necessarily block this release.
> >
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Question about SPARK-11374 (skip.header.line.count)

2016-12-10 Thread Dongjoon Hyun
Thank you for the opinion, Felix.

Bests,
Dongjoon.

On Sat, Dec 10, 2016 at 11:00 AM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> +1 I think it's useful to always have a pure SQL way and skip header for
> plain text / csv that lots of companies have.
>
>
> --
> *From:* Dongjoon Hyun <dongj...@apache.org>
> *Sent:* Friday, December 9, 2016 9:42:58 AM
> *To:* Dongjin Lee; dev@spark.apache.org
> *Subject:* Re: Question about SPARK-11374 (skip.header.line.count)
>
> Thank you for the opinion, Dongjin!
>
>
> On Thu, Dec 8, 2016 at 21:56 Dongjin Lee <dong...@apache.org> wrote:
>
>> +1 For this idea. I need it also.
>>
>> Regards,
>> Dongjin
>>
>> On Fri, Dec 9, 2016 at 8:59 AM, Dongjoon Hyun <dongj...@apache.org>
>> wrote:
>>
>> Hi, All.
>>
>>
>>
>>
>>
>> Could you give me some opinion?
>>
>>
>>
>>
>>
>> There is an old SPARK issue, SPARK-11374, about removing header lines
>> from text file.
>>
>>
>> Currently, Spark supports removing CSV header lines by the following way.
>>
>>
>>
>>
>>
>> ```
>>
>>
>> scala> spark.read.option("header","true").csv("/data").show
>>
>>
>> +---+---+
>>
>>
>> | c1| c2|
>>
>>
>> +---+---+
>>
>>
>> |  1|  a|
>>
>>
>> |  2|  b|
>>
>>
>> +---+---+
>>
>>
>> ```
>>
>>
>>
>>
>>
>> In SQL world, we can support that like the Hive way,
>> `skip.header.line.count`.
>>
>>
>>
>>
>>
>> ```
>>
>>
>> scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT
>> DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data'
>> TBLPROPERTIES('skip.header.line.count'='1')")
>>
>>
>> scala> sql("SELECT * FROM t1").show
>>
>>
>> +---+-+
>>
>>
>> | id|value|
>>
>>
>> +---+-+
>>
>>
>> |  1|a|
>>
>>
>> |  2|b|
>>
>>
>> +---+-+
>>
>>
>> ```
>>
>>
>>
>>
>>
>> Although I made a PR for this based on the JIRA issue, I want to know
>> this is really needed feature.
>>
>>
>> Is it need for your use cases? Or, it's enough for you to remove them in
>> a preprocessing stage.
>>
>>
>> If this is too old and not proper in these days, I'll close the PR and
>> JIRA issue as WON'T FIX.
>>
>>
>>
>>
>>
>> Thank you for all in advance!
>>
>>
>>
>>
>>
>> Bests,
>>
>>
>> Dongjoon.
>>
>>
>>
>>
>>
>> -
>>
>>
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> * Dongjin Lee *
>>
>>
>> * Software developer in Line+. So interested in massive-scale machine
>> learning. facebook: www.facebook.com/dongjin.lee.kr
>> <http://www.facebook.com/dongjin.lee.kr>
>> linkedin: kr.linkedin.com/in/dongjinleekr
>> <http://kr.linkedin.com/in/dongjinleekr> github:
>> <http://goog_969573159/>github.com/dongjinleekr
>> <http://github.com/dongjinleekr> twitter: www.twitter.com/dongjinleekr
>> <http://www.twitter.com/dongjinleekr> *
>>
>>
>>


Re: Question about SPARK-11374 (skip.header.line.count)

2016-12-11 Thread Dongjoon Hyun
Thank you for the opinion, Mingjie and Liang-Chi.

Dongjoon.

On Sun, Dec 11, 2016 at 5:42 PM, Liang-Chi Hsieh  wrote:

> Hi Dongjoon,
>
> I know some people only use Spark SQL with SQL syntax not Dataset API. So I
> think it should be useful to provide a way to do this in SQL.
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Question-about-
> SPARK-11374-skip-header-line-count-tp20180p20203.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Forking or upgrading Apache Parquet in Spark

2016-12-15 Thread Dongjoon Hyun
Hi, All.

I made a PR to upgrade Parquet to 1.9.0 for Apache Spark 2.2 on Late March.

- https://github.com/apache/spark/pull/16281

Currently, there occurs some important options about that. Here is the summary.

1. Forking Parquet 1.8.X and maintaining like Spark Hive.
2. Wait and see for Parquet 1.9.x adoption in the other community.
3. Make stronger integration tests including both feature and performance tests

I think we had better inform all of you on dev mailing list because it has an 
option forking.
If you have any opinion, please reply here or on the PR.

BTW, the default decision is always number 2 because we will use Apache Parquet 
1.8.1 for a while.

Bests,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-16 Thread Dongjoon Hyun
RC5 is also tested on CentOS 6.8, OpenJDK 1.8.0_111, R 3.3.2 with profiles 
`-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver -Psparkr`.

BTW, there still exist five on-going issues in JIRA (with target version 2.1.0).

1. SPARK-16845  
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB
2. SPARK-18669 Update Apache docs regard watermarking in Structured Streaming
3. SPARK-18894 Event time watermark delay threshold specified in months or 
years gives incorrect results
4. SPARK-18899 append data to a bucketed table with mismatched bucketing should 
fail

+1 with known issues for now.

Bests,
Dongjoon.

On 2016-12-16 09:57 (-0800), Sean Owen  wrote: 
> (If you have a template for these emails, maybe update it to use https
> links. They work for apache.org domains. After all we are asking people to
> verify the integrity of release artifacts, so it might as well be secure.)
> 
> (Also the new archives use .tar.gz instead of .tgz like the others. No big
> deal, my OCD eye just noticed it.)
> 
> I don't see an Apache license / notice for the Pyspark or SparkR artifacts.
> It would be good practice to include this in a convenience binary. I'm not
> sure if it's strictly mandatory, but something to adjust in any event. I
> think that's all there is to do for SparkR. For Pyspark, which packages a
> bunch of dependencies, it does include the licenses (good) but I think it
> should include the NOTICE file.
> 
> This is the first time I recall getting 0 test failures off the bat!
> I'm using Java 8 / Ubuntu 16 and yarn/hive/hadoop-2.7 profiles.
> 
> I think I'd +1 this therefore unless someone knows that the license issue
> above is real and a blocker.
> 
> On Fri, Dec 16, 2016 at 5:17 AM Reynold Xin  wrote:
> 
> > Please vote on releasing the following candidate as Apache Spark version
> > 2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 2.1.0
> > [ ] -1 Do not release this package because ...
> >
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.1.0-rc5
> > (cd0a08361e2526519e7c131c42116bf56fa62c76)
> >
> > List of JIRA tickets resolved are:
> > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://home.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1223/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-docs/
> >
> >
> > *FAQ*
> >
> > *How can I help test this release?*
> >
> > If you are a Spark user, you can help us test this release by taking an
> > existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > *What should happen to JIRA tickets still targeting 2.1.0?*
> >
> > Committers should look at those and triage. Extremely important bug fixes,
> > documentation, and API tweaks that impact compatibility should be worked on
> > immediately. Everything else please retarget to 2.1.1 or 2.2.0.
> >
> > *What happened to RC3/RC5?*
> >
> > They had issues withe release packaging and as a result were skipped.
> >
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Fwd: Question about SPARK-11374 (skip.header.line.count)

2016-12-08 Thread Dongjoon Hyun
+dev

I forget to add @user.

Dongjoon.

-- Forwarded message -
From: Dongjoon Hyun <dongj...@apache.org>
Date: Thu, Dec 8, 2016 at 16:00
Subject: Question about SPARK-11374 (skip.header.line.count)
To: <dev@spark.apache.org>


Hi, All.



Could you give me some opinion?



There is an old SPARK issue, SPARK-11374, about removing header lines from
text file.

Currently, Spark supports removing CSV header lines by the following way.



```

scala> spark.read.option("header","true").csv("/data").show

+---+---+

| c1| c2|

+---+---+

|  1|  a|

|  2|  b|

+---+---+

```



In SQL world, we can support that like the Hive way,
`skip.header.line.count`.



```

scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data'
TBLPROPERTIES('skip.header.line.count'='1')")

scala> sql("SELECT * FROM t1").show

+---+-+

| id|value|

+---+-+

|  1|a|

|  2|b|

+---+-+

```



Although I made a PR for this based on the JIRA issue, I want to know this
is really needed feature.

Is it need for your use cases? Or, it's enough for you to remove them in a
preprocessing stage.

If this is too old and not proper in these days, I'll close the PR and JIRA
issue as WON'T FIX.



Thank you for all in advance!



Bests,

Dongjoon.



-

To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


Question about SPARK-11374 (skip.header.line.count)

2016-12-08 Thread Dongjoon Hyun
Hi, All.

Could you give me some opinion?

There is an old SPARK issue, SPARK-11374, about removing header lines from text 
file.
Currently, Spark supports removing CSV header lines by the following way.

```
scala> spark.read.option("header","true").csv("/data").show
+---+---+
| c1| c2|
+---+---+
|  1|  a|
|  2|  b|
+---+---+
```

In SQL world, we can support that like the Hive way, `skip.header.line.count`.

```
scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data' 
TBLPROPERTIES('skip.header.line.count'='1')")
scala> sql("SELECT * FROM t1").show
+---+-+
| id|value|
+---+-+
|  1|a|
|  2|b|
+---+-+
```

Although I made a PR for this based on the JIRA issue, I want to know this is 
really needed feature.
Is it need for your use cases? Or, it's enough for you to remove them in a 
preprocessing stage.
If this is too old and not proper in these days, I'll close the PR and JIRA 
issue as WON'T FIX.

Thank you for all in advance!

Bests,
Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Question about SPARK-11374 (skip.header.line.count)

2016-12-09 Thread Dongjoon Hyun
Thank you for the opinion, Dongjin!


On Thu, Dec 8, 2016 at 21:56 Dongjin Lee <dong...@apache.org> wrote:

> +1 For this idea. I need it also.
>
> Regards,
> Dongjin
>
> On Fri, Dec 9, 2016 at 8:59 AM, Dongjoon Hyun <dongj...@apache.org> wrote:
>
> Hi, All.
>
>
>
>
>
> Could you give me some opinion?
>
>
>
>
>
> There is an old SPARK issue, SPARK-11374, about removing header lines from
> text file.
>
>
> Currently, Spark supports removing CSV header lines by the following way.
>
>
>
>
>
> ```
>
>
> scala> spark.read.option("header","true").csv("/data").show
>
>
> +---+---+
>
>
> | c1| c2|
>
>
> +---+---+
>
>
> |  1|  a|
>
>
> |  2|  b|
>
>
> +---+---+
>
>
> ```
>
>
>
>
>
> In SQL world, we can support that like the Hive way,
> `skip.header.line.count`.
>
>
>
>
>
> ```
>
>
> scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT
> DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data'
> TBLPROPERTIES('skip.header.line.count'='1')")
>
>
> scala> sql("SELECT * FROM t1").show
>
>
> +---+-+
>
>
> | id|value|
>
>
> +---+-+
>
>
> |  1|a|
>
>
> |  2|b|
>
>
> +---+-+
>
>
> ```
>
>
>
>
>
> Although I made a PR for this based on the JIRA issue, I want to know this
> is really needed feature.
>
>
> Is it need for your use cases? Or, it's enough for you to remove them in a
> preprocessing stage.
>
>
> If this is too old and not proper in these days, I'll close the PR and
> JIRA issue as WON'T FIX.
>
>
>
>
>
> Thank you for all in advance!
>
>
>
>
>
> Bests,
>
>
> Dongjoon.
>
>
>
>
>
> -
>
>
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
>
>
>
>
> --
> *Dongjin Lee*
>
>
> *Software developer in Line+.So interested in massive-scale machine
> learning.facebook: www.facebook.com/dongjin.lee.kr
> <http://www.facebook.com/dongjin.lee.kr>linkedin: 
> kr.linkedin.com/in/dongjinleekr
> <http://kr.linkedin.com/in/dongjinleekr>github:
> <http://goog_969573159/>github.com/dongjinleekr
> <http://github.com/dongjinleekr>twitter: www.twitter.com/dongjinleekr
> <http://www.twitter.com/dongjinleekr>*
>
>
>


Re: Parquet patch release

2017-01-06 Thread Dongjoon Hyun
Great! Thank you, Ryan.

Bests,
Dongjoon.

On Fri, Jan 6, 2017 at 15:49 Xiao Li  wrote:

> Hi, Ryan,
>
> Really thank you for your help!
>
> Happy New Year!
>
> Xiao Li
>
> 2017-01-06 15:46 GMT-08:00 Ryan Blue :
>
> Last month, there was interest in a Parquet patch release on PR #16281
> . I went ahead and reviewed
> commits that should go into a Parquet patch release and started a 1.8.2
> discussion
> 
> on the Parquet dev list. If you're interested in reviewing what goes into
> 1.8.2 or have suggestions, please follow that thread on the Parquet list.
>
> Thanks!
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
>
>
>
>


Re: Spark Project build Issues.(Intellij)

2017-06-28 Thread Dongjoon Hyun
Did you follow the guide in `IDE Setup` -> `IntelliJ` section of
http://spark.apache.org/developer-tools.html ?

Bests,
Dongjoon.

On Wed, Jun 28, 2017 at 5:13 PM, satyajit vegesna <
satyajit.apas...@gmail.com> wrote:

> Hi All,
>
> When i try to build source code of apache spark code from
> https://github.com/apache/spark.git, i am getting below errors,
>
> Error:(9, 14) EventBatch is already defined as object EventBatch
> public class EventBatch extends org.apache.avro.specific.SpecificRecordBase
> implements org.apache.avro.specific.SpecificRecord {
> Error:(9, 14) EventBatch is already defined as class EventBatch
> public class EventBatch extends org.apache.avro.specific.SpecificRecordBase
> implements org.apache.avro.specific.SpecificRecord {
> /Users/svegesna/svegesna/dev/scala/spark/external/flume-
> sink/target/scala-2.11/src_managed/main/compiled_avro/
> org/apache/spark/streaming/flume/sink/SparkFlumeProtocol.java
> Error:(26, 18) SparkFlumeProtocol is already defined as object
> SparkFlumeProtocol
> public interface SparkFlumeProtocol {
> Error:(26, 18) SparkFlumeProtocol is already defined as trait
> SparkFlumeProtocol
> public interface SparkFlumeProtocol {
> /Users/svegesna/svegesna/dev/scala/spark/external/flume-
> sink/target/scala-2.11/src_managed/main/compiled_avro/
> org/apache/spark/streaming/flume/sink/SparkSinkEvent.java
> Error:(9, 14) SparkSinkEvent is already defined as object SparkSinkEvent
> public class SparkSinkEvent extends 
> org.apache.avro.specific.SpecificRecordBase
> implements org.apache.avro.specific.SpecificRecord {
> Error:(9, 14) SparkSinkEvent is already defined as class SparkSinkEvent
> public class SparkSinkEvent extends 
> org.apache.avro.specific.SpecificRecordBase
> implements org.apache.avro.specific.SpecificRecord {
>
> Would like to know , if i can successfully build the project, so that i
> can test and debug some of spark's functionalities.
>
> Regards,
> Satyajit.
>


Re: Thoughts on release cadence?

2017-07-30 Thread Dongjoon Hyun
+1

Bests,
Dongjoon

On Sun, Jul 30, 2017 at 02:20 Sean Owen  wrote:

> The project had traditionally posted some guidance about upcoming
> releases. The last release cycle was about 6 months. What about penciling
> in December 2017 for 2.3.0? http://spark.apache.org/versioning-policy.html
>


Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-07 Thread Dongjoon Hyun
Great!

Congratulation, Hyukjin and Sameer!

Dongjoon.

On Mon, Aug 7, 2017 at 8:55 AM, Bai, Dave  wrote:

> Congrats, leveled up!=)
>
> On 8/7/17, 10:53 AM, "Matei Zaharia"  wrote:
>
> >Hi everyone,
> >
> >The Spark PMC recently voted to add Hyukjin Kwon and Sameer Agarwal as
> >committers. Join me in congratulating both of them and thanking them for
> >their contributions to the project!
> >
> >Matei
> >-
> >To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: CHAR implementation?

2017-09-15 Thread Dongjoon Hyun
Thank you, Ryan!

Yes. Right. If we turn off `spark.sql.hive.convertMetastoreParquet`, Spark
pads the space.

For ORC CHAR, it's the same. ORC only handles truncation on write.
The padding is handled by Hive side in `HiveCharWritable` via
`HiveBaseChar.java` on read.
Spark ORCFileFormat uses HiveCharWritable, so the space is padded for both
`spark.sql.hive.convertMetastoreOrc` false or true. I was able to test it
in the following PR. Previously, it's blocked by another reason.

https://github.com/apache/spark/pull/19235

It seems that we may choose,
- adding the padding logic inside Spark Parquet reader
- ignoring that for performance/backward compatibility.

Data Source v2 Read Path is merged into master today.
We will change the code base anyway in Spark 2.3.

Bests,
Dongjoon


On Fri, Sep 15, 2017 at 10:05 AM, Ryan Blue <rb...@netflix.com> wrote:

> My guess is that this is because Parquet doesn't have a CHAR type. That
> should be applied to strings by Spark for Parquet.
>
> The reason from Parquet's perspective not to support CHAR is that we have
> no expectation that it is a portable type. Non-SQL writers aren't going to
> pad values with spaces, and it is a terrible idea for Parquet to silently
> alter or truncate data to fit the CHAR type. There's also no performance
> gain from CHAR because multi-byte UTF8 characters prevent us from using a
> fixed-length binary field. The conclusion we came to is that CHAR is a
> SQL-only type and has to be enforced by SQL engines: Spark should pad or
> truncate values, and expect Parquet to faithfully hand back exactly what
> was stored.
>
> If Spark doesn't have logic for this, then it is probably relying on the
> Hive serde to pad the first case. I'm not sure what ORC does, maybe it has
> a native CHAR type.
>
> rb
>
> On Thu, Sep 14, 2017 at 5:31 PM, Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Hi, All.
>>
>> Currently, Spark shows different behavior when we uses CHAR types.
>>
>> spark-sql> CREATE TABLE t1(a CHAR(3));
>> spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
>> spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;
>>
>> spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
>> spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
>> spark-sql> INSERT INTO TABLE t3 SELECT 'a ';
>>
>> spark-sql> SELECT a, length(a) FROM t1;
>> a   3
>> spark-sql> SELECT a, length(a) FROM t2;
>> a   3
>> spark-sql> SELECT a, length(a) FROM t3;
>> a 2
>>
>> The reason I'm asking here is that it's a little bit old default behavior
>> of Spark `STORED AS PARQUET` in Spark. (Spark 1.6.3, too.)
>>
>> For me, `CREATE TABLE t1(a CHAR(3))` shows the correct one in Spark, but
>> Parquet has been de-factor standard in Spark also. (I'm not comparing this
>> with the other DBMS.)
>>
>> I'm wondering which way we need to go or want to go in Spark?
>>
>> Bests,
>> Dongjoon.
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-14 Thread Dongjoon Hyun
Hi, Holden.

It's not a problem, but the link of `List of JIRA ... with this filter`
seems to be wrong.

Bests,
Dongjoon.


On Thu, Sep 14, 2017 at 10:47 AM, Holden Karau  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.1.2. The vote is open until Friday September 22nd at 18:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.2
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.1.2-rc1
>  (6f470323a036365
> 6999dd36cb33f528afe627c12)
>
> List of JIRA tickets resolved in this release can be found with this
> filter.
> 
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1248/
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you can
> add the staging repository to your projects resolvers and test with the RC
> (make sure to clean up the artifact cache before/after so you don't end up
> building with a out of date RC going forward).
>
> *What should happen to JIRA tickets still targeting 2.1.2?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.3.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1. That being said if
> there is something which is a regression form 2.1.1 that has not been
> correctly targeted please ping a committer to help target the issue (you
> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
> 
> )
>
> *What are the unresolved* issues targeted for 2.1.2
> 
> ?
>
> At the time of the writing, there is one in progress major issue
> SPARK-21985 , I
> believe Andrew Ray & HyukjinKwon are looking into this one.
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-14 Thread Dongjoon Hyun
Yea. I think I found the root cause.

The correct one is the following as Sean said.


https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.2

The current RC vote email has the following.

  List of JIRA tickets resolved in this release can be found with this
filter.
  <https://issues.apache.org/jira/browse/*SPARK-20134*
?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.2>

You can see the linke from the Apache archive.

https://lists.apache.org/thread.html/52952da40593624737533cdcfad695977612a298ce26205af3f2306c@%3Cdev.spark.apache.org%3E

SPARK-20134 is 2.1.1, so I was confused.

Thanks,
Dongjoon.


On Thu, Sep 14, 2017 at 12:18 PM, Sean Owen <so...@cloudera.com> wrote:

> I think the search filter is OK, but for whatever reason the filter link
> includes what JIRA you're currently browsing, and that one is not actually
> included in the filter. It opens on a JIRA that's not included, but the
> search results look correct.   project = SPARK AND fixVersion = 2.1.2
>
> On Thu, Sep 14, 2017 at 9:15 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> Hi, Holden.
>>
>> It's not a problem, but the link of `List of JIRA ... with this filter`
>> seems to be wrong.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Sep 14, 2017 at 10:47 AM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.2. The vote is open until Friday September 22nd at 18:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.2-rc1
>>> <https://github.com/apache/spark/tree/v2.1.2-rc1> (6f470323a036365
>>> 6999dd36cb33f528afe627c12)
>>>
>>> List of JIRA tickets resolved in this release can be found with this
>>> filter.
>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.2>
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1248/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~pwendell/spark-releases/spark-2.1.2-rc1-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala you
>>> can add the staging repository to your projects resolvers and test with the
>>> RC (make sure to clean up the artifact cache before/after so you don't end
>>> up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.3.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1. That being said if
>>> there is something which is a regression form 2.1.1 that has not been
>>> correctly targeted please ping a committer to help target the issue (you
>>> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>>> <https://issues.apache.org/jira/browse/SPARK-21985?jql=project%20%3D%20SPARK%20AND%20status%20%3D%20OPEN%20AND%20(affectedVersion%20%3D%202.1.2%20OR%20affectedVersion%20%3D%202.1.1)>
>>> )
>>>
>>> *What are the unresolved* issues targeted for 2.1.2
>>> <https://issues.apache.org/jira/browse/SPARK-21985?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.2>
>>> ?
>>>
>>> At the time of the writing, there is one in progress major issue
>>> SPARK-21985 <https://issues.apache.org/jira/browse/SPARK-21985>, I
>>> believe Andrew Ray & HyukjinKwon are looking into this one.
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>


CHAR implementation?

2017-09-14 Thread Dongjoon Hyun
Hi, All.

Currently, Spark shows different behavior when we uses CHAR types.

spark-sql> CREATE TABLE t1(a CHAR(3));
spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t1;
a   3
spark-sql> SELECT a, length(a) FROM t2;
a   3
spark-sql> SELECT a, length(a) FROM t3;
a 2

The reason I'm asking here is that it's a little bit old default behavior
of Spark `STORED AS PARQUET` in Spark. (Spark 1.6.3, too.)

For me, `CREATE TABLE t1(a CHAR(3))` shows the correct one in Spark, but
Parquet has been de-factor standard in Spark also. (I'm not comparing this
with the other DBMS.)

I'm wondering which way we need to go or want to go in Spark?

Bests,
Dongjoon.


Re: Disabling Closed -> Reopened transition for non-committers

2017-10-04 Thread Dongjoon Hyun
It can stop reopening, but new JIRA issues with duplicate content will be
created intentionally instead.

Is that policy (privileged reopening) used in other Apache communities for
that purpose?


On Wed, Oct 4, 2017 at 7:06 PM, Sean Owen  wrote:

> We have this problem occasionally, where a disgruntled user continually
> reopens an issue after it's closed.
>
> https://issues.apache.org/jira/browse/SPARK-21999
>
> (Feel free to comment on this one if anyone disagrees)
>
> Regardless of that particular JIRA, I'd like to disable to Closed ->
> Reopened transition for non-committers: https://issues.apache.org/
> jira/browse/INFRA-15221
>
>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Dongjoon Hyun
+1 (non-binding).

On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博  wrote:

> +1
>
>
> Reynold Xin 于2017年9月7日 周四下午12:04写道:
>
>> +1 as well
>>
>> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust 
>> wrote:
>>
>>> +1
>>>
>>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue 
>>> wrote:
>>>
 +1 (non-binding)

 Thanks for making the updates reflected in the current PR. It would be
 great to see the doc updated before it is finally published though.

 Right now it feels like this SPIP is focused more on getting the basics
 right for what many datasources are already doing in API V1 combined with
 other private APIs, vs pushing forward state of the art for performance.

 I think that’s the right approach for this SPIP. We can add the support
 you’re talking about later with a more specific plan that doesn’t block
 fixing the problems that this addresses.
 ​

 On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
 hvanhov...@databricks.com> wrote:

> +1 (binding)
>
> I personally believe that there is quite a big difference between
> having a generic data source interface with a low surface area and pushing
> down a significant part of query processing into a datasource. The later
> has much wider wider surface area and will require us to stabilize most of
> the internal catalyst API's which will be a significant burden on the
> community to maintain and has the potential to slow development velocity
> significantly. If you want to write such integrations then you should be
> prepared to work with catalyst internals and own up to the fact that 
> things
> might change across minor versions (and in some cases even maintenance
> releases). If you are willing to go down that road, then your best bet is
> to use the already existing spark session extensions which will allow you
> to write such integrations and can be used as an `escape hatch`.
>
>
> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash 
> wrote:
>
>> +0 (non-binding)
>>
>> I think there are benefits to unifying all the Spark-internal
>> datasources into a common public API for sure.  It will serve as a 
>> forcing
>> function to ensure that those internal datasources aren't advantaged vs
>> datasources developed externally as plugins to Spark, and that all Spark
>> features are available to all datasources.
>>
>> But I also think this read-path proposal avoids the more difficult
>> questions around how to continue pushing datasource performance forwards.
>> James Baker (my colleague) had a number of questions about advanced
>> pushdowns (combined sorting and filtering), and Reynold also noted that
>> pushdown of aggregates and joins are desirable on longer timeframes as
>> well.  The Spark community saw similar requests, for aggregate pushdown 
>> in
>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
>> in SPARK-12449.  Clearly a number of people are interested in this kind 
>> of
>> performance work for datasources.
>>
>> To leave enough space for datasource developers to continue
>> experimenting with advanced interactions between Spark and their
>> datasources, I'd propose we leave some sort of escape valve that enables
>> these datasources to keep pushing the boundaries without forking Spark.
>> Possibly that looks like an additional unsupported/unstable interface 
>> that
>> pushes down an entire (unstable API) logical plan, which is expected to
>> break API on every release.   (Spark attempts this full-plan pushdown, 
>> and
>> if that fails Spark ignores it and continues on with the rest of the V2 
>> API
>> for compatibility).  Or maybe it looks like something else that we don't
>> know of yet.  Possibly this falls outside of the desired goals for the V2
>> API and instead should be a separate SPIP.
>>
>> If we had a plan for this kind of escape valve for advanced
>> datasource developers I'd be an unequivocal +1.  Right now it feels like
>> this SPIP is focused more on getting the basics right for what many
>> datasources are already doing in API V1 combined with other private APIs,
>> vs pushing forward state of the art for performance.
>>
>> Andrew
>>
>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
>> suresh.thalam...@gmail.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>>
>>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan  wrote:
>>>
>>> Hi all,
>>>
>>> In the previous discussion, we decided to split the read and write
>>> path of data source v2 into 2 SPIPs, and I'm sending this email to call 
>>> a
>>> vote for Data Source V2 

Re: 2.1.2 maintenance release?

2017-09-07 Thread Dongjoon Hyun
+1!

As of today,

For 2.1.2, we have 87 commits. (2.1.1 was released 4 months ago)
For 2.2.1, we have 95 commits. (2.2.0 was released 2 months ago)

Can we have 2.2.1, too?

Bests,
Dongjoon.


On Thu, Sep 7, 2017 at 2:14 AM, Sean Owen  wrote:

> In a separate conversation about bugs and a security issue fixed in 2.1.x
> and 2.0.x, Marcelo suggested it could be time for a maintenance release.
> I'm not sure what our stance on 2.0.x is, but 2.1.2 seems like it could be
> valuable to release.
>
> Thoughts? I believe Holden had expressed interest in even managing the
> release process, but maybe others are interested as well. That is, this
> could also be a chance to share that burden and spread release experience
> around a bit.
>
> Sean
>


Re: [VOTE] Spark 2.1.2 (RC2)

2017-09-27 Thread Dongjoon Hyun
+1 (non-binding)

Bests,
Dongjoon.


On Wed, Sep 27, 2017 at 7:54 AM, Denny Lee  wrote:

> +1 (non-binding)
>
>
> On Wed, Sep 27, 2017 at 6:54 AM Sean Owen  wrote:
>
>> +1
>>
>> I tested the source release.
>> Hashes and signature (your signature) check out, project builds and tests
>> pass with -Phadoop-2.7 -Pyarn -Phive -Pmesos on Debian 9.
>> List of issues look good and there are no open issues at all for 2.1.2.
>>
>> Great work on improving the build process and docs.
>>
>>
>> On Wed, Sep 27, 2017 at 5:47 AM Holden Karau 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.2. The vote is open until Wednesday October 4th at 23:59
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.2-rc2
>>>  (fabbb7f59e47590
>>> 114366d14e15fbbff8c88593c)
>>>
>>> List of JIRA tickets resolved in this release can be found with this
>>> filter.
>>> 
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~holden/spark-2.1.2-rc2-bin/
>>>
>>> Release artifacts are signed with a key from:
>>> https://people.apache.org/~holden/holdens_keys.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1251
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~holden/spark-2.1.2-rc2-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test with
>>> the RC (make sure to clean up the artifact cache before/after so you
>>> don't end up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.3.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1. That being said
>>> if there is something which is a regression form 2.1.1 that has not
>>> been correctly targeted please ping a committer to help target the issue
>>> (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>>> 
>>> )
>>>
>>> *What are the unresolved* issues targeted for 2.1.2
>>> 
>>> ?
>>>
>>> At this time there are no open unresolved issues.
>>>
>>> *Is there anything different about this release?*
>>>
>>> This is the first release in awhile not built on the AMPLAB Jenkins.
>>> This is good because it means future releases can more easily be built and
>>> signed securely (and I've been updating the documentation in
>>> https://github.com/apache/spark-website/pull/66 as I progress), however
>>> the chances of a mistake are higher with any change like this. If there
>>> something you normally take for granted as correct when checking a release,
>>> please double check this time :)
>>>
>>> *Should I be committing code to branch-2.1?*
>>>
>>> Thanks for asking! Please treat this stage in the RC process as "code
>>> freeze" so bug fixes only. If you're uncertain if something should be back
>>> ported please reach out. If you do commit to branch-2.1 please tag your
>>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
>>> 2.1.3 fixed into 2.1.2 as appropriate.
>>>
>>> *Why the longer voting window?*
>>>
>>> Since there is a large industry big data conference this week I figured
>>> I'd add a little bit of extra buffer time just to make sure everyone has a
>>> chance to take a look.
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>


Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Dongjoon Hyun
+1 (non-binding)

Dongjoon.

On Tue, Oct 3, 2017 at 5:13 AM, Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:

> +1
>
> On Tue, Oct 3, 2017 at 1:32 PM, Sean Owen  wrote:
>
>> +1 same as last RC. Tests pass, sigs and hashes are OK.
>>
>> On Tue, Oct 3, 2017 at 7:24 AM Holden Karau  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.2. The vote is open until Saturday October 7th at 9:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.2-rc4
>>>  (2abaea9e40fce81
>>> cd4626498e0f5c28a70917499)
>>>
>>> List of JIRA tickets resolved in this release can be found with this
>>> filter.
>>> 
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>>>
>>> Release artifacts are signed with a key from:
>>> https://people.apache.org/~holden/holdens_keys.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1252
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test with
>>> the RC (make sure to clean up the artifact cache before/after so you
>>> don't end up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.3.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1. That being said
>>> if there is something which is a regression form 2.1.1 that has not
>>> been correctly targeted please ping a committer to help target the issue
>>> (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>>> 
>>> )
>>>
>>> *What are the unresolved* issues targeted for 2.1.2
>>> 
>>> ?
>>>
>>> At this time there are no open unresolved issues.
>>>
>>> *Is there anything different about this release?*
>>>
>>> This is the first release in awhile not built on the AMPLAB Jenkins.
>>> This is good because it means future releases can more easily be built and
>>> signed securely (and I've been updating the documentation in
>>> https://github.com/apache/spark-website/pull/66 as I progress), however
>>> the chances of a mistake are higher with any change like this. If there
>>> something you normally take for granted as correct when checking a release,
>>> please double check this time :)
>>>
>>> *Should I be committing code to branch-2.1?*
>>>
>>> Thanks for asking! Please treat this stage in the RC process as "code
>>> freeze" so bug fixes only. If you're uncertain if something should be back
>>> ported please reach out. If you do commit to branch-2.1 please tag your
>>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
>>> 2.1.3 fixed into 2.1.2 as appropriate.
>>>
>>> *What happened to RC3?*
>>>
>>> Some R+zinc interactions kept it from getting out the door.
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>


Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Dongjoon Hyun
Great! Congratulation, Jerry! :D

Bests,
Dongjoon.

On Mon, Aug 28, 2017 at 6:30 PM, Ted Yu  wrote:

> Congratulations, Jerry !
>
> On Mon, Aug 28, 2017 at 6:28 PM, Matei Zaharia 
> wrote:
>
>> Hi everyone,
>>
>> The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai
>> has been contributing to many areas of the project for a long time, so it’s
>> great to see him join. Join me in thanking and congratulating him!
>>
>> Matei
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: Increase Timeout or optimize Spark UT?

2017-08-25 Thread Dongjoon Hyun
BTW, the situation seems to become worse, now we lost two builds.

-
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/
-
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/

Dongjoon.




On Thu, Aug 24, 2017 at 2:27 AM, Maciej Szymkiewicz 
wrote:

> It won't be used by PySpark and SparkR, will it?
>
> On 23 August 2017 at 23:40, Michael Armbrust 
> wrote:
>
>> I think we already set the number of partitions to 5 in tests
>> 
>> ?
>>
>> On Tue, Aug 22, 2017 at 3:25 PM, Maciej Szymkiewicz <
>> mszymkiew...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> From my experience it is possible to cut quite a lot by reducing
>>> spark.sql.shuffle.partitions to some reasonable value (let's say
>>> comparable to the number of cores). 200 is a serious overkill for most of
>>> the test cases anyway.
>>>
>>>
>>> Best,
>>> Maciej
>>>
>>>
>>>
>>> On 21 August 2017 at 03:00, Dong Joon Hyun 
>>> wrote:
>>>
 +1 for any efforts to recover Jenkins!



 Thank you for the direction.



 Bests,

 Dongjoon.



 *From: *Reynold Xin 
 *Date: *Sunday, August 20, 2017 at 5:53 PM
 *To: *Dong Joon Hyun 
 *Cc: *"dev@spark.apache.org" 
 *Subject: *Re: Increase Timeout or optimize Spark UT?



 It seems like it's time to look into how to cut down some of the test
 runtimes. Test runtimes will slowly go up given the way development
 happens. 3 hr is already a very long time for tests to run.





 On Sun, Aug 20, 2017 at 5:45 PM, Dong Joon Hyun 
 wrote:

 Hi, All.



 Recently, Apache Spark master branch test (SBT with hadoop-2.7 / 2.6)
 has been hitting the build timeout.



 Please see the build time trend.



 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
 t%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/buildTimeTrend



 All recent 22 builds fail due to timeout directly/indirectly. The last
 success (SBT with Hadoop-2.7) is 15th August.



 We may do the followings.



1. Increase Build Timeout (3 hr 30 min)
2. Optimize UTs (Scala/Java/Python/UT)



 But, Option 1 will be the immediate solution for now . Could you update
 the Jenkins setup?



 Bests,

 Dongjoon.



>>>
>>>
>>
>
>
> --
>
> Z poważaniem,
> Maciej Szymkiewicz
>
>


Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-28 Thread Dongjoon Hyun
+1 (non-binding)

RC2 is tested on CentOS, too.

Bests,
Dongjoon.

On Tue, Nov 28, 2017 at 4:35 PM, Hyukjin Kwon  wrote:

> +1
>
> 2017-11-29 8:18 GMT+09:00 Henry Robinson :
>
>> (My vote is non-binding, of course).
>>
>> On 28 November 2017 at 14:53, Henry Robinson  wrote:
>>
>>> +1, tests all pass for me on Ubuntu 16.04.
>>>
>>> On 28 November 2017 at 10:36, Herman van Hövell tot Westerflier <
>>> hvanhov...@databricks.com> wrote:
>>>
 +1

 On Tue, Nov 28, 2017 at 7:35 PM, Felix Cheung 
 wrote:

> +1
>
> Thanks Sean. Please vote!
>
> Tested various scenarios with R package. Ubuntu, Debian, Windows
> r-devel and release and on r-hub. Verified CRAN checks are clean
> (only 1 NOTE!) and no leaked files (.cache removed, /tmp clean)
>
>
> On Sun, Nov 26, 2017 at 11:55 AM Sean Owen  wrote:
>
>> Yes it downloads recent releases. The test worked for me on a second
>> try, so I suspect a bad mirror. If this comes up frequently we can just 
>> add
>> retry logic, as the closer.lua script will return different mirrors each
>> time.
>>
>> The tests all pass for me on the latest Debian, so +1 for this
>> release.
>>
>> (I committed the change to set -Xss4m for tests consistently, but
>> this shouldn't block a release.)
>>
>>
>> On Sat, Nov 25, 2017 at 12:47 PM Felix Cheung 
>> wrote:
>>
>>> Ah sorry digging through the history it looks like this is changed
>>> relatively recently and should only download previous releases.
>>>
>>> Perhaps we are intermittently hitting a mirror that doesn’t have the
>>> files?
>>>
>>>
>>> https://github.com/apache/spark/commit/daa838b8886496e64700b
>>> 55d1301d348f1d5c9ae
>>>
>>>
>>> On Sat, Nov 25, 2017 at 10:36 AM Felix Cheung <
>>> felixche...@apache.org> wrote:
>>>
 Thanks Sean.

 For the second one, it looks like the  
 HiveExternalCatalogVersionsSuite is
 trying to download the release tgz from the official Apache mirror, 
 which
 won’t work unless the release is actually, released?

 val preferredMirror =
 Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;,
 "-q", "-O", "-").!!.trim
 val url = s"$preferredMirror/spark/spark
 -$version/spark-$version-bin-hadoop2.7.tgz"

 It’s proabbly getting an error page instead.


 On Sat, Nov 25, 2017 at 10:28 AM Sean Owen 
 wrote:

> I hit the same StackOverflowError as in the previous RC test, but,
> pretty sure this is just because the increased thread stack size JVM 
> flag
> isn't applied consistently. This seems to resolve it:
>
> https://github.com/apache/spark/pull/19820
>
> This wouldn't block release IMHO.
>
>
> I am currently investigating this failure though -- seems like the
> mechanism that downloads Spark tarballs needs fixing, or updating, in 
> the
> 2.2 branch?
>
> HiveExternalCatalogVersionsSuite:
>
> gzip: stdin: not in gzip format
>
> tar: Child returned status 1
>
> tar: Error is not recoverable: exiting now
>
> *** RUN ABORTED ***
>
>   java.io.IOException: Cannot run program "./bin/spark-submit"
> (in directory "/tmp/test-spark/spark-2.0.2"): error=2, No such
> file or directory
>
> On Sat, Nov 25, 2017 at 12:34 AM Felix Cheung <
> felixche...@apache.org> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark
>> version 2.2.1. The vote is open until Friday December 1, 2017 at
>> 8:00:00 am UTC and passes if a majority of at least 3 PMC +1
>> votes are cast.
>>
>>
>> [ ] +1 Release this package as Apache Spark 2.2.1
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see
>> https://spark.apache.org/
>>
>>
>> The tag to be voted on is v2.2.1-rc2 https://github.com/
>> apache/spark/tree/v2.2.1-rc2  (e30e2698a2193f0bbdcd4edb88471
>> 0819ab6397c)
>>
>> List of JIRA tickets resolved in this release can be found here
>> https://issues.apache.org/jira/projects/SPARK/versions/12340470
>>
>>
>> The release files, including signatures, digests, etc. can be
>> found at:
>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-bin/
>>
>> Release artifacts are signed with the following 

Re: SPARK-22267 issue: Spark SQL incorrectly reads ORC file when column order is different

2017-11-14 Thread Dongjoon Hyun
Hi, Mark.

That is one of the reasons why I left it behind from the previous PR
(below) and I'm focusing is the second approach; use OrcFileFormat with
convertMetastoreOrc.

https://github.com/apache/spark/pull/19470
[SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC
table instead of ORC file schema

With `convertMetastoreOrc=true`, Spark 2.3 will become stabler and faster.
Also, it's the default Spark way to handle Parquet.

BTW, thank you for looking at SPARK-22267. So far, I'm not looking at that
issue.
If we have a fix for SPARK-22267 in Spark 2.3, it would be great!

Bests,
Dongjoon.


On Tue, Nov 14, 2017 at 3:46 AM, Mark Petruska 
wrote:

>   Hi,
> I'm very new to spark development, and would like to get guidance from
> more experienced members.
> Sorry this email will be long as I try to explain the details.
>
> Started to investigate the issue SPARK-22267
> ; added some test
> cases to highlight the problem in the PR
> . Here are my findings:
>
> - for parquet the test case succeeds as expected
>
> - the sql test case for orc:
> - when CONVERT_METASTORE_ORC is set to "true" the data fields are
> presented in the desired order
> - when it is "false" the columns are read in the wrong order
> - Reason: when `isConvertible` returns true in `RelationConversions`
> the plan executes `convertToLogicalRelation`, which in turn uses
> `OrcFileFormat` to read the data; otherwise it uses the classes in
> "hive-exec:1.2.1".
>
> - the HadoopRDD test case was added to further investigate the parameter
> values to discover a working combination, but unfortunately no combination
> of "serialization.ddl" and "columns" result in success. It seems that those
> fields do not have any effect on the order of the resulting data fields.
>
>
> At this point I do not see any option to fix this issue without risking
> "backward compatibility" problems.
> The possible actions (as I see them):
> - link a new version of "hive-exec": surely this bug has been fixed in a
> newer version
> - use `OrcFileFormat` for reading orc data regardless of the setting of
> CONVERT_METASTORE_ORC
> - also there's an `OrcNewInputFormat` class in "hive-exec", but it
> implements an InputFormat interface from a different package, hence it is
> incompatible with HadoopRDD at the moment
>
> Please help me. Did I miss some viable options?
>
> Thanks,
> Mark
>
>


Re: Cutting the RC for Spark 2.2.1 release

2017-11-08 Thread Dongjoon Hyun
It's great, Felix!

As of today, `branch-2.2` seems to be broken due to SPARK-22211 (Scala UT
failure) and SPARK-22417 (Python UT failure).
I pinged you at both.

Bests,
Dongjoon.


On Wed, Nov 8, 2017 at 5:51 PM, Holden Karau  wrote:

> Thanks for stepping up and running the 2.2.1 release :)
>
> On Wed, Nov 8, 2017 at 3:57 PM Felix Cheung 
> wrote:
>
>> Hi!
>>
>> As we are closing down on the few known issues I think we are ready to
>> tag and cut the 2.2.1 release.
>>
>> If you are aware of any issue that you think should go into this release
>> please feel free to ping me and mark the JIRA as targeting 2.2.1. I will be
>> scrubbing JIRA in the next few days.
>>
>> So unless we hear otherwise, I’m going to tag and build the RC starting
>> Saturday EOD (PT). Please be patient since I’m going to be new at this :)
>> but will keep the dev@ posted for any update.
>>
>> Yours
>> RM for 2.2.1
>>
>>
>> --
> Twitter: https://twitter.com/holdenkarau
>


Apache Spark 2.3 and Apache ORC 1.4 finally

2017-12-05 Thread Dongjoon Hyun
Hi, All.

Today, Apache Spark starts to use Apache ORC 1.4 as a `native` ORC
implementation.

SPARK-20728 Make OrcFileFormat configurable between `sql/hive` and
`sql/core`.
-
https://github.com/apache/spark/commit/326f1d6728a7734c228d8bfaa69442a1c7b92e9b

Thank you so much for all your supports for this!

I'll proceed more ORC issues in order to make a synergy between both
communities.

Please see https://issues.apache.org/jira/browse/SPARK-20901 to see the
updates

Bests,
Dongjoon.


Re: Running lint-java during PR builds?

2018-05-21 Thread Dongjoon Hyun
Thank you for reconsidering this, Hyukjin. :)

Bests,
Dongjoon.


On Mon, May 21, 2018 at 9:20 AM, Marcelo Vanzin <van...@cloudera.com> wrote:

> Is there a way to trigger it conditionally? e.g. only if the diff
> touches java files.
>
> On Mon, May 21, 2018 at 9:17 AM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
> > One concern is with the volume of test runs on Travis.
> >
> > In ASF projects Travis could get significantly
> > backed up since - if I recall - all of ASF shares one queue.
> >
> > At the number of PRs Spark has this could be a big issue.
> >
> >
> > 
> > From: Marcelo Vanzin <van...@cloudera.com>
> > Sent: Monday, May 21, 2018 9:08:28 AM
> > To: Hyukjin Kwon
> > Cc: Dongjoon Hyun; dev
> > Subject: Re: Running lint-java during PR builds?
> >
> > I'm fine with it. I tried to use the existing checkstyle sbt plugin
> > (trying to fix SPARK-22269), but it depends on an ancient version of
> > checkstyle, and I don't know sbt enough to figure out how to hack
> > classpaths and class loaders when applying rules, so gave up.
> >
> > On Mon, May 21, 2018 at 1:47 AM, Hyukjin Kwon <gurwls...@gmail.com>
> wrote:
> >> I am going to open an INFRA JIRA if there's no explicit objection in few
> >> days.
> >>
> >> 2018-05-21 13:09 GMT+08:00 Hyukjin Kwon <gurwls...@gmail.com>:
> >>>
> >>> I would like to revive this proposal. Travis CI. Shall we give this
> try?
> >>> I
> >>> think it's worth trying it.
> >>>
> >>> 2016-11-17 3:50 GMT+08:00 Dongjoon Hyun <dongj...@apache.org>:
> >>>>
> >>>> Hi, Marcelo and Ryan.
> >>>>
> >>>> That was the main purpose of my proposal about Travis.CI.
> >>>> IMO, that is the only way to achieve that without any harmful
> >>>> side-effect
> >>>> on Jenkins infra.
> >>>>
> >>>> Spark is already ready for that. Like AppVoyer, if one of you files an
> >>>> INFRA jira issue to enable that, they will turn on that. Then, we can
> >>>> try it
> >>>> and see the result. Also, you can turn off easily again if you don't
> >>>> want.
> >>>>
> >>>> Without this, we will consume more community efforts. For example, we
> >>>> merged lint-java error fix PR seven hours ago, but the master branch
> >>>> still
> >>>> has one lint-java error.
> >>>>
> >>>> https://travis-ci.org/dongjoon-hyun/spark/jobs/176351319
> >>>>
> >>>> Actually, I've been monitoring the history here. (It's synced every 30
> >>>> minutes.)
> >>>>
> >>>> https://travis-ci.org/dongjoon-hyun/spark/builds
> >>>>
> >>>> Could we give a change to this?
> >>>>
> >>>> Bests,
> >>>> Dongjoon.
> >>>>
> >>>> On 2016-11-15 13:40 (-0800), "Shixiong(Ryan) Zhu"
> >>>> <shixi...@databricks.com> wrote:
> >>>> > I remember it's because you need to run `mvn install` before running
> >>>> > lint-java if the maven cache is empty, and `mvn install` is pretty
> >>>> > heavy.
> >>>> >
> >>>> > On Tue, Nov 15, 2016 at 1:21 PM, Marcelo Vanzin <
> van...@cloudera.com>
> >>>> > wrote:
> >>>> >
> >>>> > > Hey all,
> >>>> > >
> >>>> > > Is there a reason why lint-java is not run during PR builds? I see
> >>>> > > it
> >>>> > > seems to be maven-only, is it really expensive to run after an sbt
> >>>> > > build?
> >>>> > >
> >>>> > > I see a lot of PRs coming in to fix Java style issues, and those
> all
> >>>> > > seem a little unnecessary. Either we're enforcing style checks or
> >>>> > > we're not, and right now it seems we aren't.
> >>>> > >
> >>>> > > --
> >>>> > > Marcelo
> >>>> > >
> >>>> > >
> >>>> > > 
> -
> >>>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>> > >
> >>>> > >
> >>>> >
> >>>>
> >>>> -
> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>>
> >>>
> >>
> >
> >
> >
> > --
> > Marcelo
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
>
> --
> Marcelo
>


Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-03 Thread Dongjoon Hyun
+1

Bests,
Dongjoon.

On Sat, Jun 2, 2018 at 8:09 PM, Denny Lee  wrote:

> +1
>
> On Sat, Jun 2, 2018 at 4:53 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I'll give that a try, but I'll still have to figure out what to do if
>> none of the release builds work with hadoop-aws, since Flintrock deploys
>> Spark release builds to set up a cluster. Building Spark is slow, so we
>> only do it if the user specifically requests a Spark version by git hash.
>> (This is basically how spark-ec2 did things, too.)
>>
>>
>> On Sat, Jun 2, 2018 at 6:54 PM Marcelo Vanzin 
>> wrote:
>>
>>> If you're building your own Spark, definitely try the hadoop-cloud
>>> profile. Then you don't even need to pull anything at runtime,
>>> everything is already packaged with Spark.
>>>
>>> On Fri, Jun 1, 2018 at 6:51 PM, Nicholas Chammas
>>>  wrote:
>>> > pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work for
>>> me
>>> > either (even building with -Phadoop-2.7). I guess I’ve been relying on
>>> an
>>> > unsupported pattern and will need to figure something else out going
>>> forward
>>> > in order to use s3a://.
>>> >
>>> >
>>> > On Fri, Jun 1, 2018 at 9:09 PM Marcelo Vanzin 
>>> wrote:
>>> >>
>>> >> I have personally never tried to include hadoop-aws that way. But at
>>> >> the very least, I'd try to use the same version of Hadoop as the Spark
>>> >> build (2.7.3 IIRC). I don't really expect a different version to work,
>>> >> and if it did in the past it definitely was not by design.
>>> >>
>>> >> On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas
>>> >>  wrote:
>>> >> > Building with -Phadoop-2.7 didn’t help, and if I remember correctly,
>>> >> > building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0
>>> release,
>>> >> > so
>>> >> > it appears something has changed since then.
>>> >> >
>>> >> > I wasn’t familiar with -Phadoop-cloud, but I can try that.
>>> >> >
>>> >> > My goal here is simply to confirm that this release of Spark works
>>> with
>>> >> > hadoop-aws like past releases did, particularly for Flintrock users
>>> who
>>> >> > use
>>> >> > Spark with S3A.
>>> >> >
>>> >> > We currently provide -hadoop2.6, -hadoop2.7, and -without-hadoop
>>> builds
>>> >> > with
>>> >> > every Spark release. If the -hadoop2.7 release build won’t work with
>>> >> > hadoop-aws anymore, are there plans to provide a new build type that
>>> >> > will?
>>> >> >
>>> >> > Apologies if the question is poorly formed. I’m batting a bit
>>> outside my
>>> >> > league here. Again, my goal is simply to confirm that I/my users
>>> still
>>> >> > have
>>> >> > a way to use s3a://. In the past, that way was simply to call
>>> pyspark
>>> >> > --packages org.apache.hadoop:hadoop-aws:2.8.4 or something very
>>> similar.
>>> >> > If
>>> >> > that will no longer work, I’m trying to confirm that the change of
>>> >> > behavior
>>> >> > is intentional or acceptable (as a review for the Spark project) and
>>> >> > figure
>>> >> > out what I need to change (as due diligence for Flintrock’s users).
>>> >> >
>>> >> > Nick
>>> >> >
>>> >> >
>>> >> > On Fri, Jun 1, 2018 at 8:21 PM Marcelo Vanzin 
>>> >> > wrote:
>>> >> >>
>>> >> >> Using the hadoop-aws package is probably going to be a little more
>>> >> >> complicated than that. The best bet is to use a custom build of
>>> Spark
>>> >> >> that includes it (use -Phadoop-cloud). Otherwise you're probably
>>> >> >> looking at some nasty dependency issues, especially if you end up
>>> >> >> mixing different versions of Hadoop.
>>> >> >>
>>> >> >> On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas
>>> >> >>  wrote:
>>> >> >> > I was able to successfully launch a Spark cluster on EC2 at
>>> 2.3.1 RC4
>>> >> >> > using
>>> >> >> > Flintrock. However, trying to load the hadoop-aws package gave me
>>> >> >> > some
>>> >> >> > errors.
>>> >> >> >
>>> >> >> > $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4
>>> >> >> >
>>> >> >> > 
>>> >> >> >
>>> >> >> > :: problems summary ::
>>> >> >> >  WARNINGS
>>> >> >> > [NOT FOUND  ]
>>> >> >> > com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle) (2ms)
>>> >> >> >  local-m2-cache: tried
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > file:/home/ec2-user/.m2/repository/com/sun/jersey/
>>> jersey-json/1.9/jersey-json-1.9.jar
>>> >> >> > [NOT FOUND  ]
>>> >> >> > com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms)
>>> >> >> >  local-m2-cache: tried
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > file:/home/ec2-user/.m2/repository/com/sun/jersey/
>>> jersey-server/1.9/jersey-server-1.9.jar
>>> >> >> > [NOT FOUND  ]
>>> >> >> > org.codehaus.jettison#jettison;1.1!jettison.jar(bundle) (1ms)
>>> >> >> >  local-m2-cache: tried
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > file:/home/ec2-user/.m2/repository/org/codehaus/
>>> jettison/jettison/1.1/jettison-1.1.jar
>>> >> >> > [NOT FOUND  ]
>>> >> >> > 

Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-10 Thread Dongjoon Hyun
Hi, All.

Vectorized ORC Reader is now supported in Apache Spark 2.3.

https://issues.apache.org/jira/browse/SPARK-16060

It has been a long journey. From now, Spark can read ORC files faster
without feature penalty.

Thank you for all your support, especially Wenchen Fan.

It's done by two commits.

[SPARK-16060][SQL] Support Vectorized ORC Reader
https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a
9d6f2bf766

[SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc
reader
https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964c
faa5ba99d1

Please check OrcReadBenchmark for the final speed-up from `Hive built-in
ORC` to `Native ORC Vectorized`.

https://github.com/apache/spark/blob/master/sql/hive/
src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

Thank you.

Bests,
Dongjoon.


Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-25 Thread Dongjoon Hyun
SPARK-23221 is one of the reasons for Kafka-test-suite deadlock issue.

For the hang issues, it seems not to be marked as a failure correctly in
Apache Spark Jenkins history.


On Thu, Jan 25, 2018 at 1:03 PM, Marcelo Vanzin  wrote:

> On Thu, Jan 25, 2018 at 12:29 PM, Sean Owen  wrote:
> > I am still seeing these tests fail or hang:
> >
> > - subscribing topic by name from earliest offsets (failOnDataLoss: false)
> > - subscribing topic by name from earliest offsets (failOnDataLoss: true)
>
> This is something that we are seeing internally on a different version
> Spark, and we're currently investigating with our Kafka people. Not
> sure it's the same issue (we have a newer version of Kafka libraries),
> but this is just another way of saying that I don't think those hangs
> are new in 2.3, at least.
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Schema Evolution in Apache Spark

2018-01-12 Thread Dongjoon Hyun
This is about Spark-layer test cases on **read-only** CSV, JSON, Parquet,
ORC files. You can find more details and comparisons in terms of Spatk
support coverage.

Bests,
Dongjoon.


On Thu, Jan 11, 2018 at 22:19 Georg Heiler <georg.kf.hei...@gmail.com>
wrote:

> Isn't this related to the data format used, i.e. parquet, Avro, ... which
> already support changing schema?
>
> Dongjoon Hyun <dongjoon.h...@gmail.com> schrieb am Fr., 12. Jan. 2018 um
> 02:30 Uhr:
>
>> Hi, All.
>>
>> A data schema can evolve in several ways and Apache Spark 2.3 already
>> supports the followings for file-based data sources like
>> CSV/JSON/ORC/Parquet.
>>
>> 1. Add a column
>> 2. Remove a column
>> 3. Change a column position
>> 4. Change a column type
>>
>> Can we guarantee users some schema evolution coverage on file-based data
>> sources by adding schema evolution test suites explicitly? So far, there
>> are some test cases.
>>
>> For simplicity, I have several assumptions on schema evolution.
>>
>> 1. A safe evolution without data loss.
>> - e.g. from small types to larger types like int-to-long, not vice
>> versa.
>> 2. Final schema is given by users (or Hive)
>> 3. Simple Spark data types supported by Spark vectorized execution.
>>
>> I made a test case PR to receive your opinions for this.
>>
>> [SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based
>> data sources
>> - https://github.com/apache/spark/pull/20208
>>
>> Could you take a look and give some opinions?
>>
>> Bests,
>> Dongjoon.
>>
>


Build timed out for `branch-2.3 (hadoop-2.7)`

2018-01-11 Thread Dongjoon Hyun
Hi, All and Shane.

Can we increase the build time for `branch-2.3` during 2.3 RC period?

There are two known test issues, but the Jenkins on branch-2.3 with
hadoop-2.7 fails with build timeout. So, it's difficult to monitor whether
the branch is healthy or not.

Build timed out (after 255 minutes). Marking the build as aborted.
Build was aborted
...
Finished: ABORTED

-
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/60/console
-
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/47/console

Bests,
Dongjoon.


Schema Evolution in Apache Spark

2018-01-11 Thread Dongjoon Hyun
Hi, All.

A data schema can evolve in several ways and Apache Spark 2.3 already
supports the followings for file-based data sources like
CSV/JSON/ORC/Parquet.

1. Add a column
2. Remove a column
3. Change a column position
4. Change a column type

Can we guarantee users some schema evolution coverage on file-based data
sources by adding schema evolution test suites explicitly? So far, there
are some test cases.

For simplicity, I have several assumptions on schema evolution.

1. A safe evolution without data loss.
- e.g. from small types to larger types like int-to-long, not vice
versa.
2. Final schema is given by users (or Hive)
3. Simple Spark data types supported by Spark vectorized execution.

I made a test case PR to receive your opinions for this.

[SPARK-23007][SQL][TEST] Add schema evolution test suite for file-based
data sources
- https://github.com/apache/spark/pull/20208

Could you take a look and give some opinions?

Bests,
Dongjoon.


Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-28 Thread Dongjoon Hyun
Hi, Nicolas.

Yes. In Apache Spark 2.3, there are new sub-improvements for SPARK-20901
(Feature parity for ORC with Parquet).
For your questions, the following three are related.

1. spark.sql.orc.impl="native"
By default, `native` ORC implementation (based on the latest ORC 1.4.1)
is added.
The old one is `hive` implementation.

2. spark.sql.orc.enableVectorizedReader="true"
By default, `native` ORC implementation uses Vectorized Reader code
path if possible.
Please note that `Vectorization(Parquet/ORC) in Apache Spark` is only
supported only for simple data types.

3. spark.sql.hive.convertMetastoreOrc=true
Like Parquet, by default, Hive tables are converted into file-based
data sources to use Vectorization technique.

Bests,
Dongjoon.



On Sun, Jan 28, 2018 at 4:15 AM, Nicolas Paris <nipari...@gmail.com> wrote:

> Hi
>
> Thanks for this work.
>
> Will this affect both:
> 1) spark.read.format("orc").load("...")
> 2) spark.sql("select ... from my_orc_table_in_hive")
>
> ?
>
>
> Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait :
> > Hi, All.
> >
> > Vectorized ORC Reader is now supported in Apache Spark 2.3.
> >
> > https://issues.apache.org/jira/browse/SPARK-16060
> >
> > It has been a long journey. From now, Spark can read ORC files faster
> without
> > feature penalty.
> >
> > Thank you for all your support, especially Wenchen Fan.
> >
> > It's done by two commits.
> >
> > [SPARK-16060][SQL] Support Vectorized ORC Reader
> > https://github.com/apache/spark/commit/
> f44ba910f58083458e1133502e193a
> > 9d6f2bf766
> >
> > [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized
> orc
> > reader
> > https://github.com/apache/spark/commit/
> eaac60a1e20e29084b7151ffca964c
> > faa5ba99d1
> >
> > Please check OrcReadBenchmark for the final speed-up from `Hive built-in
> ORC`
> > to `Native ORC Vectorized`.
> >
> > https://github.com/apache/spark/blob/master/sql/hive/
> src/test/scala/org/
> > apache/spark/sql/hive/orc/OrcReadBenchmark.scala
> >
> > Thank you.
> >
> > Bests,
> > Dongjoon.
>


`convertMetastoreOrc/Parquet` issue

2018-02-07 Thread Dongjoon Hyun
Hi, All.

SPARK-22279  turned on
`convertMetastoreOrc` by default for `Feature Parity with Parquet`.

Unfortunately, it will be turned off back via
https://github.com/apache/spark/pull/20536 in Apache Spark 2.3 RC3 because
a well-known `convertMetastoreOrc/Parquet` issue is not resolved yet.

That means reading Hive ORC tables is not accelerated by vectorization in a
vanilla configuration.

This will prevent any regression on Hive ORC tables and users are still
able to turn on it manually.

Sorry for last minute notification and thank you for your support and
decision.

Bests,
Dongjoon.


Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Dongjoon Hyun
+1.

I tested RC4 on CentOS 7.4 / OpenJDK 1.8.0_161 with `-Pyarn -Phadoop-2.7
-Pkinesis-asl -Phive -Phive-thriftserver -Psparkr`.

Bests,
Dongjoon.



On Sun, Feb 18, 2018 at 3:22 PM, Denny Lee  wrote:

> +1 (non-binding)
>
> Built and tested on macOS and Ubuntu.
>
>
> On Sun, Feb 18, 2018 at 3:19 PM Ricardo Almeida <
> ricardo.alme...@actnowib.com> wrote:
>
>> +1 (non-binding)
>>
>> Built and tested on macOS 10.12.6 Java 8 (build 1.8.0_111). No
>> regressions detected so far.
>>
>>
>> On 18 February 2018 at 16:12, Sean Owen  wrote:
>>
>>> +1 from me as last time, same outcome.
>>>
>>> I saw one test fail, but passed on a second run, so just seems flaky.
>>>
>>> - subscribing topic by name from latest offsets (failOnDataLoss: true)
>>> *** FAILED ***
>>>   Error while stopping stream:
>>>   query.exception() is not empty after clean stop: org.apache.spark.sql.
>>> streaming.StreamingQueryException: Writing job failed.
>>>   === Streaming Query ===
>>>   Identifier: [id = cdd647ec-d7f0-437b-9950-ce9d79d691d1, runId =
>>> 3a7cf7ec-670a-48b6-8185-8b6cd7e27f96]
>>>   Current Committed Offsets: {KafkaSource[Subscribe[topic-4]]:
>>> {"topic-4":{"2":1,"4":1,"1":0,"3":0,"0":2}}}
>>>   Current Available Offsets: {}
>>>
>>>   Current State: TERMINATED
>>>   Thread State: RUNNABLE
>>>
>>> On Sat, Feb 17, 2018 at 3:41 PM Sameer Agarwal 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.3.0. The vote is open until Thursday February 22, 2018 at 8:00:00
 am UTC and passes if a majority of at least 3 PMC +1 votes are cast.


 [ ] +1 Release this package as Apache Spark 2.3.0

 [ ] -1 Do not release this package because ...


 To learn more about Apache Spark, please see https://spark.apache.org/

 The tag to be voted on is v2.3.0-rc4: https://github.com/apache/
 spark/tree/v2.3.0-rc4 (44095cb65500739695b0324c177c19dfa1471472)

 List of JIRA tickets resolved in this release can be found here:
 https://issues.apache.org/jira/projects/SPARK/versions/12339551

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/

 Release artifacts are signed with the following key:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1265/

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-
 docs/_site/index.html


 FAQ

 ===
 What are the unresolved issues targeted for 2.3.0?
 ===

 Please see https://s.apache.org/oXKi. At the time of writing, there
 are currently no known release blockers.

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala you
 can add the staging repository to your projects resolvers and test with the
 RC (make sure to clean up the artifact cache before/after so you don't end
 up building with a out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 2.3.0?
 ===

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should be
 worked on immediately. Everything else please retarget to 2.3.1 or 2.4.0 as
 appropriate.

 ===
 Why is my bug not fixed?
 ===

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from 2.2.0. That being
 said, if there is something which is a regression from 2.2.0 and has not
 been correctly targeted please ping me or a committer to help target the
 issue (you can see the open issues listed as impacting Spark 2.3.0 at
 https://s.apache.org/WmoI).

>>>
>>


Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Dongjoon Hyun
In addition to Hyukjin's `github.io` result, `jekyll` also forwards the
search result links correctly.

SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch

And, connect `http://127.0.0.1:4000`.

This will be the same in Apache Spark websites.

Bests,
Dongjoon.



On Mon, Feb 19, 2018 at 8:37 PM, vaquar khan  wrote:

> +1
>
> Regards,
> Vaquar khan
>
> On Mon, Feb 19, 2018 at 10:29 PM, Xiao Li  wrote:
>
>> +1.
>>
>> So far, no function/performance regression in Spark SQL, Core and
>> PySpark.
>>
>> Thanks!
>>
>> Xiao
>>
>> 2018-02-19 19:47 GMT-08:00 Hyukjin Kwon :
>>
>>> Ah, I see. For 1), I overlooked Felix's input here. I couldn't foresee
>>> this when I added this documentation because it worked in my simple demo:
>>>
>>> https://spark-test.github.io/sparksqldoc/search.html?q=approx
>>> https://spark-test.github.io/sparksqldoc/#approx_percentile
>>>
>>> Will try to investigate this shortly too.
>>>
>>>
>>>
>>> 2018-02-20 11:45 GMT+09:00 Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu>:
>>>
 For (1) I think it has something to do with https://dist.apache.org/r
 epos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/ not automatically
 going to https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-d
 ocs/_site/api/sql/index.html -- So if you see the link to
 approx_percentile the link we generate is https://dist.apache.org/rep
 os/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/#approx_percentile --
 This doesn't work as Felix said but https://dist.apache.org/re
 pos/dist/dev/spark/v2.3.0-rc4-docs/_site/api/sql/index.html#
 approx_percentile works

 I'm not sure how this will behave on the main site. FWIW
 http://spark.apache.org/docs/latest/api/python/ does redirect to
 http://spark.apache.org/docs/latest/api/python/index.html

 Thanks
 Shivaram

 On Mon, Feb 19, 2018 at 6:31 PM, Felix Cheung <
 felixcheun...@hotmail.com> wrote:

> Ah sorry I realize my wordings were unclear (not enough zzz or coffee)
>
> So to clarify,
> 1) when searching for a word in the Sql function doc, it does return
> that search result page correctly, however, none of the link in result
> opens to the actual doc page, so to take the search I included as an
> example, if you click on approx_percentile, for instance, it brings open
> the web directory instead.
>
> 2) The second is the dist location we are voting on has a .iml file,
> which is normally not included in release or release RC and it is unsigned
> and without hash (therefore seems like it should not be in the release)
>
> Thanks!
>
> _
> From: Shivaram Venkataraman 
> Sent: Tuesday, February 20, 2018 2:24 AM
> Subject: Re: [VOTE] Spark 2.3.0 (RC4)
> To: Felix Cheung 
> Cc: Sean Owen , dev 
>
>
>
> FWIW The search result link works for me
>
> Shivaram
>
> On Mon, Feb 19, 2018 at 6:21 PM, Felix Cheung <
> felixcheun...@hotmail.com> wrote:
>
>> These are two separate things:
>>
>> Does the search result links work for you?
>>
>> The second is the dist location we are voting on has a .iml file.
>>
>> _
>> From: Sean Owen 
>> Sent: Tuesday, February 20, 2018 2:19 AM
>> Subject: Re: [VOTE] Spark 2.3.0 (RC4)
>> To: Felix Cheung 
>> Cc: dev 
>>
>>
>>
>> Maybe I misunderstand, but I don't see any .iml file in the 4 results
>> on that page? it looks reasonable.
>>
>> On Mon, Feb 19, 2018 at 8:02 PM Felix Cheung <
>> felixcheun...@hotmail.com> wrote:
>>
>>> Any idea with sql func docs search result returning broken links as
>>> below?
>>>
>>> *From:* Felix Cheung 
>>> *Sent:* Sunday, February 18, 2018 10:05:22 AM
>>> *To:* Sameer Agarwal; Sameer Agarwal
>>>
>>> *Cc:* dev
>>> *Subject:* Re: [VOTE] Spark 2.3.0 (RC4)
>>> Quick questions:
>>>
>>> is there search link for sql functions quite right?
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs
>>> /_site/api/sql/search.html?q=app
>>>
>>> this file shouldn't be included? https://dist.apache.org/repos/
>>> dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml
>>>
>>>
>>
>>
>
>
>

>>>
>>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783
> Greater Chicago
>


Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-26 Thread Dongjoon Hyun
+1 (non-binding).

Bests,
Dongjoon.


On Mon, Feb 26, 2018 at 9:14 AM, Ryan Blue 
wrote:

> +1 (non-binding)
>
> On Sat, Feb 24, 2018 at 4:17 PM, Xiao Li  wrote:
>
>> +1 (binding) in Spark SQL, Core and PySpark.
>>
>> Xiao
>>
>> 2018-02-24 14:49 GMT-08:00 Ricardo Almeida 
>> :
>>
>>> +1 (non-binding)
>>>
>>> same as previous RC
>>>
>>> On 24 February 2018 at 11:10, Hyukjin Kwon  wrote:
>>>
 +1

 2018-02-24 16:57 GMT+09:00 Bryan Cutler :

> +1
> Tests passed and additionally ran Arrow related tests and did some
> perf checks with python 2.7.14
>
> On Fri, Feb 23, 2018 at 6:18 PM, Holden Karau 
> wrote:
>
>> Note: given the state of Jenkins I'd love to see Bryan Cutler or
>> someone with Arrow experience sign off on this release.
>>
>> On Fri, Feb 23, 2018 at 6:13 PM, Cheng Lian 
>> wrote:
>>
>>> +1 (binding)
>>>
>>> Passed all the tests, looks good.
>>>
>>> Cheng
>>>
>>> On 2/23/18 15:00, Holden Karau wrote:
>>>
>>> +1 (binding)
>>> PySpark artifacts install in a fresh Py3 virtual env
>>>
>>> On Feb 23, 2018 7:55 AM, "Denny Lee"  wrote:
>>>
 +1 (non-binding)

 On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough <
 joshgoldsboroughs...@gmail.com> wrote:

> New to testing out Spark RCs for the community but I was able to
> run some of the basic unit tests without error so for what it's 
> worth, I'm
> a +1.
>
> On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal <
> samee...@apache.org> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark
>> version 2.3.0. The vote is open until Tuesday February 27, 2018 at 
>> 8:00:00
>> am UTC and passes if a majority of at least 3 PMC +1 votes are cast.
>>
>>
>> [ ] +1 Release this package as Apache Spark 2.3.0
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see
>> https://spark.apache.org/
>>
>> The tag to be voted on is v2.3.0-rc5:
>> https://github.com/apache/spark/tree/v2.3.0-rc5
>> (992447fb30ee9ebb3cf794f2d06f4d63a2d792db)
>>
>> List of JIRA tickets resolved in this release can be found here:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>
>> The release files, including signatures, digests, etc. can be
>> found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/
>>
>> Release artifacts are signed with the following key:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapache
>> spark-1266/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs
>> /_site/index.html
>>
>>
>> FAQ
>>
>> ===
>> What are the unresolved issues targeted for 2.3.0?
>> ===
>>
>> Please see https://s.apache.org/oXKi. At the time of writing,
>> there are currently no known release blockers.
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by
>> taking an existing Spark workload and running on this release 
>> candidate,
>> then reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and
>> install the current RC and see if anything important breaks, in the
>> Java/Scala you can add the staging repository to your projects 
>> resolvers
>> and test with the RC (make sure to clean up the artifact cache 
>> before/after
>> so you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.3.0?
>> ===
>>
>> Committers should look at those and triage. Extremely important
>> bug fixes, documentation, and API tweaks that impact compatibility 
>> should
>> be worked on immediately. Everything else please retarget to 2.3.1 
>> or 2.4.0
>> as appropriate.

Re: [VOTE] SPIP: Standardize SQL logical plans

2018-07-18 Thread Dongjoon Hyun
+1 (non-binding).

Bests,
Dongjoon.

On Wed, Jul 18, 2018 at 11:32 AM Henry Robinson 
wrote:

> +1 (non-binding)
> On Wed, Jul 18, 2018 at 9:12 AM Reynold Xin  wrote:
>
>> +1 on this, on the condition that we can come up with a design that will
>> remove the existing plans.
>>
>>
>> On Tue, Jul 17, 2018 at 11:00 AM Ryan Blue  wrote:
>>
>>> Hi everyone,
>>>
>>> From discussion on the proposal doc and the discussion thread, I think
>>> we have consensus around the plan to standardize logical write operations
>>> for DataSourceV2. I would like to call a vote on the proposal.
>>>
>>> The proposal doc is here: SPIP: Standardize SQL logical plans
>>> 
>>> .
>>>
>>> This vote is for the plan in that doc. The related SPIP with APIs to
>>> create/alter/drop tables will be a separate vote.
>>>
>>> Please vote in the next 72 hours:
>>>
>>> [+1]: Spark should adopt the SPIP
>>> [-1]: Spark should not adopt the SPIP because . . .
>>>
>>> Thanks for voting, everyone!
>>>
>>> --
>>> Ryan Blue
>>>
>>


Re: Branch 2.4 is cut

2018-09-07 Thread Dongjoon Hyun
Thank you, Shane! :D

Bests,
Dongjoon.

On Fri, Sep 7, 2018 at 9:51 AM shane knapp  wrote:

> i'll try and get to the 2.4 branch stuff today...
>
>


Re: Branch 2.4 is cut

2018-09-06 Thread Dongjoon Hyun
Great for branch cut and Scala 2.12 build.

We also need to add `branch-2.4` to our Jenkins dashboard to prevent any
regression.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/

Bests,
Dongjoon.


On Thu, Sep 6, 2018 at 6:56 AM Wenchen Fan  wrote:

> Good news! I'll try and update you later. Thanks!
>
> On Thu, Sep 6, 2018 at 9:44 PM Sean Owen  wrote:
>
>> BTW it does appear the Scala 2.12 build works now:
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/229/
>>
>> Let's try also producing a 2.12 build with this release. The machinery
>> should be there in the release scripts, but let me know if something fails
>> while running the release for 2.12.
>>
>> On Thu, Sep 6, 2018 at 12:32 AM Wenchen Fan  wrote:
>>
>>> Hi all,
>>>
>>> I've cut the branch-2.4 since all the major blockers are resolved. If no
>>> objections I'll shortly followup with an RC to get the QA started in
>>> parallel.
>>>
>>> Committers, please only merge PRs to branch-2.4 that are bug fixes,
>>> performance regression fixes, document changes, or test suites changes.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>


Re: Maintenance releases for SPARK-23852?

2018-04-11 Thread Dongjoon Hyun
Great.

If we can upgrade the parquet dependency from 1.8.2 to 1.8.3 in Apache
Spark 2.3.1, let's upgrade orc dependency from 1.4.1 to 1.4.3 together.

Currently, the patch is only merged into master branch now. 1.4.1 has the
following issue.

https://issues.apache.org/jira/browse/SPARK-23340

Bests,
Dongjoon.



On Wed, Apr 11, 2018 at 1:23 PM, Reynold Xin  wrote:

> Seems like this would make sense... we usually make maintenance releases
> for bug fixes after a month anyway.
>
>
> On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson  wrote:
>
>>
>>
>> On 11 April 2018 at 12:47, Ryan Blue  wrote:
>>
>>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of
>>> Spark.
>>>
>>> To be clear though, this only affects Spark when reading data written by
>>> Impala, right? Or does Parquet CPP also produce data like this?
>>>
>>
>> I don't know about parquet-cpp, but yeah, the only implementation I've
>> seen writing the half-completed stats is Impala. (as you know, that's
>> compliant with the spec, just an unusual choice).
>>
>>
>>>
>>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson 
>>> wrote:
>>>
 Hi all -

 SPARK-23852 (where a query can silently give wrong results thanks to a
 predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
 I've been involved with, we've released maintenance releases for bugs of
 this severity.

 Since Spark 2.4.0 is probably a while away, I wanted to see if there
 was any consensus over whether we should consider (at least) a 2.3.1.

 The reason this particular issue is a bit tricky is that the Parquet
 community haven't yet produced a maintenance release that fixes the
 underlying bug, but they are in the process of releasing a new minor
 version, 1.10, which includes a fix. Having spoken to a couple of Parquet
 developers, they'd be willing to consider a maintenance release, but would
 probably only bother if we (or another affected project) asked them to.

 My guess is that we wouldn't want to upgrade to a new minor version of
 Parquet for a Spark maintenance release, so asking for a Parquet
 maintenance release makes sense.

 What does everyone think?

 Best,
 Henry

>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>


Re: Maintenance releases for SPARK-23852?

2018-04-17 Thread Dongjoon Hyun
Since it's a backport from master to branch-2.3 for ORC 1.4.3, I made a
backport PR.

https://github.com/apache/spark/pull/21093

Thank you for raising this issues and confirming, Henry and Xiao. :)

Bests,
Dongjoon.


On Tue, Apr 17, 2018 at 12:01 AM, Xiao Li <gatorsm...@gmail.com> wrote:

> Yes, it sounds good to me. We can upgrade both Parquet 1.8.2 to 1.8.3 and
> ORC 1.4.1 to 1.4.3 in our upcoming Spark 2.3.1 release.
>
> Thanks for your efforts! @Henry and @Dongjoon
>
> Xiao
>
> 2018-04-16 14:41 GMT-07:00 Henry Robinson <he...@apache.org>:
>
>> Seems like there aren't any objections. I'll pick this thread back up
>> when a Parquet maintenance release has happened.
>>
>> Henry
>>
>> On 11 April 2018 at 14:00, Dongjoon Hyun <dongjoon.h...@gmail.com> wrote:
>>
>>> Great.
>>>
>>> If we can upgrade the parquet dependency from 1.8.2 to 1.8.3 in Apache
>>> Spark 2.3.1, let's upgrade orc dependency from 1.4.1 to 1.4.3 together.
>>>
>>> Currently, the patch is only merged into master branch now. 1.4.1 has
>>> the following issue.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-23340
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Wed, Apr 11, 2018 at 1:23 PM, Reynold Xin <r...@databricks.com>
>>> wrote:
>>>
>>>> Seems like this would make sense... we usually make maintenance
>>>> releases for bug fixes after a month anyway.
>>>>
>>>>
>>>> On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson <he...@apache.org>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 11 April 2018 at 12:47, Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of
>>>>>> Spark.
>>>>>>
>>>>>> To be clear though, this only affects Spark when reading data written
>>>>>> by Impala, right? Or does Parquet CPP also produce data like this?
>>>>>>
>>>>>
>>>>> I don't know about parquet-cpp, but yeah, the only implementation I've
>>>>> seen writing the half-completed stats is Impala. (as you know, that's
>>>>> compliant with the spec, just an unusual choice).
>>>>>
>>>>>
>>>>>>
>>>>>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson <he...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all -
>>>>>>>
>>>>>>> SPARK-23852 (where a query can silently give wrong results thanks to
>>>>>>> a predicate pushdown bug in Parquet) is a fairly bad bug. In other 
>>>>>>> projects
>>>>>>> I've been involved with, we've released maintenance releases for bugs of
>>>>>>> this severity.
>>>>>>>
>>>>>>> Since Spark 2.4.0 is probably a while away, I wanted to see if there
>>>>>>> was any consensus over whether we should consider (at least) a 2.3.1.
>>>>>>>
>>>>>>> The reason this particular issue is a bit tricky is that the Parquet
>>>>>>> community haven't yet produced a maintenance release that fixes the
>>>>>>> underlying bug, but they are in the process of releasing a new minor
>>>>>>> version, 1.10, which includes a fix. Having spoken to a couple of 
>>>>>>> Parquet
>>>>>>> developers, they'd be willing to consider a maintenance release, but 
>>>>>>> would
>>>>>>> probably only bother if we (or another affected project) asked them to.
>>>>>>>
>>>>>>> My guess is that we wouldn't want to upgrade to a new minor version
>>>>>>> of Parquet for a Spark maintenance release, so asking for a Parquet
>>>>>>> maintenance release makes sense.
>>>>>>>
>>>>>>> What does everyone think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Henry
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Dongjoon Hyun
Congratulations!

Bests,
Dongjoon.

On Mon, Apr 2, 2018 at 07:57 Cody Koeninger  wrote:

> Congrats!
>
> On Mon, Apr 2, 2018 at 12:28 AM, Wenchen Fan  wrote:
> > Hi all,
> >
> > The Spark PMC recently added Zhenhua Wang as a committer on the project.
> > Zhenhua is the major contributor of the CBO project, and has been
> > contributing across several areas of Spark for a while, focusing
> especially
> > on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
> >
> > Wenchen
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 2.2.2 (RC2)

2018-06-28 Thread Dongjoon Hyun
+1

Tested on CentOS 7.4 and Oracle JDK 1.8.0_171.

Bests,
Dongjoon.

On Thu, Jun 28, 2018 at 7:24 AM Takeshi Yamamuro 
wrote:

> +1
>
> I run tests on a EC2 m4.2xlarge instance;
> [ec2-user]$ java -version
> openjdk version "1.8.0_171"
> OpenJDK Runtime Environment (build 1.8.0_171-b10)
> OpenJDK 64-Bit Server VM (build 25.171-b10, mixed mode)
>
>
>
>
> On Thu, Jun 28, 2018 at 11:38 AM Wenchen Fan  wrote:
>
>> +1
>>
>> On Thu, Jun 28, 2018 at 10:19 AM zhenya Sun  wrote:
>>
>>> +1
>>>
>>> 在 2018年6月28日,上午10:15,Hyukjin Kwon  写道:
>>>
>>> +1
>>>
>>> 2018년 6월 28일 (목) 오전 8:42, Sean Owen 님이 작성:
>>>
 +1 from me too.

 On Wed, Jun 27, 2018 at 3:31 PM Tom Graves <
 tgraves...@yahoo.com.invalid> wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.2.2.
>
> The vote is open until Mon, July 2nd @ 9PM UTC (2PM PDT) and passes if
> a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.2.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.2.2-rc2 (commit
> fc28ba3db7185e84b6dbd02ad8ef8f1d06b9e3c6):
> https://github.com/apache/spark/tree/v2.2.2-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.2.2-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1276/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.2.2-rc2-docs/
>
> The list of bug fixes going into 2.2.2 can be found at the following
> URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342171
>
>
> Notes:
>
> - RC1 was not sent for a vote. I had trouble building it, and by the
> time I got
>   things fixed, there was a blocker bug filed. It was already tagged
> in git
>   at that time.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.2.2?
> ===
>
> The current list of open tickets targeted at 2.2.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.2.2
>
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Tom Graves
>

>>>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Random sampling in tests

2018-10-08 Thread Dongjoon Hyun
Sean's approach looks much better to me (
https://github.com/apache/spark/pull/22672)

It achieves both contradictory goals simultaneously; keeping all test
coverages and reducing the time from 2:31 to 0:24.

Since we can remove test coverages anytime, can we proceed with Sean's
non-intrusive approach first before removing?

Bests,
Dongjoon.


On Mon, Oct 8, 2018 at 8:57 AM Xiao Li  wrote:

> Yes. Testing all the timezones is not needed.
>
> Xiao
>
> On Mon, Oct 8, 2018 at 8:36 AM Maxim Gekk 
> wrote:
>
>> Hi All,
>>
>> I believe we should also take into account what we test, for example, I
>> don't think it makes sense to check all timezones for JSON/CSV
>> functions/datasources because those timezones are just passed to external
>> libraries. So, the same code is involved into testing of each out of 650
>> timezones. We basically just spend time and resources on testing the
>> external libraries.
>>
>> I mean the PRs: https://github.com/apache/spark/pull/22657 and
>> https://github.com/apache/spark/pull/22379#discussion_r223039662
>>
>> Maxim Gekk
>>
>> Technical Solutions Lead
>>
>> Databricks B. V.  
>>
>>
>> On Mon, Oct 8, 2018 at 4:49 PM Sean Owen  wrote:
>>
>>> If the problem is simply reducing the wall-clock time of tests, then
>>> even before we get to this question, I'm advocating:
>>>
>>> 1) try simple parallelization of tests within the suite. In this
>>> instance there's no reason not to test these in parallel and get a 8x
>>> or 16x speedup from cores. This assumes, I believe correctly, that the
>>> machines aren't generally near 100% utilization
>>> 2) explicitly choose a smaller, more directed set of cases to test
>>>
>>> Randomly choosing test cases with a fixed seed is basically 2, but not
>>> choosing test cases for a particular reason. You can vary the seed but
>>> as a rule the same random subset of tests is always chosen. Could be
>>> fine if there's no reason at all to prefer some cases over others. But
>>> I am guessing any wild guess at the most important subset of cases to
>>> test is better than random.
>>>
>>> I'm trying 1) right now instead in these several cases.
>>> On Mon, Oct 8, 2018 at 9:24 AM Xiao Li  wrote:
>>> >
>>> > For this specific case, I do not think we should test all the
>>> timezone. If this is fast, I am fine to leave it unchanged. However, this
>>> is very slow. Thus, I even prefer to reducing the tested timezone to a
>>> smaller number or just hardcoding some specific time zones.
>>> >
>>> > In general, I like Reynold’s idea by including the seed value and we
>>> add the seed name in the test case name. This can help us reproduce it.
>>> >
>>> > Xiao
>>> >
>>> > On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin 
>>> wrote:
>>> >>
>>> >> I'm personally not a big fan of doing it that way in the PR. It is
>>> perfectly fine to employ randomized tests, and in this case it might even
>>> be fine to just pick couple different timezones like the way it happened in
>>> the PR, but we should:
>>> >>
>>> >> 1. Document in the code comment why we did it that way.
>>> >>
>>> >> 2. Use a seed and log the seed, so any test failures can be
>>> reproduced deterministically. For this one, it'd be better to pick the seed
>>> from a seed environmental variable. If the env variable is not set, set to
>>> a random seed.
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>>> >>>
>>> >>> Recently, I've seen 3 pull requests that try to speed up a test suite
>>> >>> that tests a bunch of cases by randomly choosing different subsets of
>>> >>> cases to test on each Jenkins run.
>>> >>>
>>> >>> There's disagreement about whether this is good approach to improving
>>> >>> test runtime. Here's a discussion on one that was committed:
>>> >>> https://github.com/apache/spark/pull/22631/files#r223190476
>>> >>>
>>> >>> I'm flagging it for more input.
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-12 Thread Dongjoon Hyun
Hi, Holden.

Since that's a performance at 2.4.0, I marked as `Blocker` four days ago.

Bests,
Dongjoon.


On Fri, Oct 12, 2018 at 11:45 AM Holden Karau  wrote:

> Following up I just wanted to make sure this new blocker that Dongjoon
> designated is surfaced -
> https://jira.apache.org/jira/browse/SPARK-25579?filter=12340409=affectedVersion%20%3D%202.4.0%20AND%20cf%5B12310320%5D%20is%20EMPTY%20AND%20project%20%3D%20spark%20AND%20(status%20%3D%20%22In%20Progress%22%20OR%20resolution%20%3D%20Unresolved)%20AND%20priority%20%3D%20Blocker
>
> On Thu, Oct 11, 2018 at 2:05 PM Xiao Li  wrote:
>
>> -1. We have two correctness bugs:
>> https://issues.apache.org/jira/browse/SPARK-25714 and
>> https://issues.apache.org/jira/browse/SPARK-25708.
>>
>> Let us fix all the three issues in ScalaUDF, as mentioned by Sean.
>>
>> Xiao
>>
>>
>> Sean Owen  于2018年10月11日周四 上午9:04写道:
>>
>>> This is a legitimate question about the behavior of ScalaUDF after the
>>> change to support 2.12:
>>> https://github.com/apache/spark/pull/22259#discussion_r224295469
>>> Not quite a blocker I think, but a potential gotcha we definitely need
>>> to highlight in release notes. There may be an argument for changing
>>> ScalaUDF again before the release. Have a look, anyone familiar with
>>> catalyst.
>>> On Wed, Oct 10, 2018 at 3:00 PM Sean Owen  wrote:
>>> >
>>> > +1. I tested the source build against Scala 2.12 and common build
>>> > profiles. License and sigs look OK.
>>> >
>>> > No blockers; one critical:
>>> >
>>> > SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4
>>> >
>>> > I think this one is "won't fix" though? not trying to restore the
>>> behavior?
>>> >
>>> > Other items open for 2.4.0:
>>> >
>>> > SPARK-25347 Document image data source in doc site
>>> > SPARK-25584 Document libsvm data source in doc site
>>> > SPARK-25179 Document the features that require Pyarrow 0.10
>>> > SPARK-25507 Update documents for the new features in 2.4 release
>>> > SPARK-25346 Document Spark builtin data sources
>>> > SPARK-24464 Unit tests for MLlib's Instrumentation
>>> > SPARK-23197 Flaky test:
>>> spark.streaming.ReceiverSuite."receiver_life_cycle"
>>> > SPARK-22809 pyspark is sensitive to imports with dots
>>> > SPARK-21030 extend hint syntax to support any expression for Python
>>> and R
>>> >
>>> > Anyone know enough to close or retarget them? they don't look critical
>>> > for 2.4, SPARK-25507 has no content, itself. SPARK-25179 "Document the
>>> > features that require Pyarrow 0.10" however sounds like it could have
>>> > been important for 2.4? if not a blocker.
>>> >
>>> > PS I don't think that SPARK-25150 is an issue; see JIRA. At least
>>> > there is some ongoing discussion there.
>>> >
>>> > I am evaluating
>>> > https://github.com/apache/spark/pull/22259#discussion_r224252642 right
>>> > now.
>>> >
>>> >
>>> > On Wed, Oct 10, 2018 at 9:47 AM Wenchen Fan 
>>> wrote:
>>> > >
>>> > > Please vote on releasing the following candidate as Apache Spark
>>> version 2.4.0.
>>> > >
>>> > > The vote is open until October 1 PST and passes if a majority +1 PMC
>>> votes are cast, with
>>> > > a minimum of 3 +1 votes.
>>> > >
>>> > > [ ] +1 Release this package as Apache Spark 2.4.0
>>> > > [ ] -1 Do not release this package because ...
>>> > >
>>> > > To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> > >
>>> > > The tag to be voted on is v2.4.0-rc3 (commit
>>> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
>>> > > https://github.com/apache/spark/tree/v2.4.0-rc3
>>> > >
>>> > > The release files, including signatures, digests, etc. can be found
>>> at:
>>> > > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/
>>> > >
>>> > > Signatures used for Spark RCs can be found in this file:
>>> > > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> > >
>>> > > The staging repository for this release can be found at:
>>> > >
>>> https://repository.apache.org/content/repositories/orgapachespark-1289
>>> > >
>>> > > The documentation corresponding to this release can be found at:
>>> > > https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/
>>> > >
>>> > > The list of bug fixes going into 2.4.0 can be found at the following
>>> URL:
>>> > > https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>> > >
>>> > > FAQ
>>> > >
>>> > > =
>>> > > How can I help test this release?
>>> > > =
>>> > >
>>> > > If you are a Spark user, you can help us test this release by taking
>>> > > an existing Spark workload and running on this release candidate,
>>> then
>>> > > reporting any regressions.
>>> > >
>>> > > If you're working in PySpark you can set up a virtual env and install
>>> > > the current RC and see if anything important breaks, in the
>>> Java/Scala
>>> > > you can add the staging repository to your projects resolvers and
>>> test
>>> > > with the RC (make sure to clean up the artifact cache before/after so
>>> > > you don't end up building with a 

Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Dongjoon Hyun
I also agree with Reynold and Xiao.

Although I love that new feature, Spark 2.4 branch-cut was made a long time
ago.

We cannot backport new features at this stage at RC4.

In addition, could you split Apache SPARK issue IDs, Ilan? It's confusing
during discussion.

 (1) [SPARK-23257][K8S] Kerberos Support for Spark on K8S (merged
yesterday for Apache Spark 3.0)
 (2) [SPARK-23257][K8S][TESTS] Kerberos Support Integration Tests (a
live PR with about *2000 lines. It's not a follow-up size.*)

For (1), it's merged yesterday. That means more people start to try (1)
from today. We need more time to stabilize it.
For (2), it's still under review.

Both (1) and (2) looks only valid for Spark 3.0.0.

Bests,
Dongjoon.



On Tue, Oct 16, 2018 at 1:32 PM Xiao Li  wrote:

> We need to strictly follow the backport and release policy. We can't merge
> such a new feature into a RC branch or a minor release (e.g., 2.4.1).
>
> Cheers,
>
> Xiao
>
> Bolke de Bruin  于2018年10月16日周二 下午12:48写道:
>
>> Chiming in here. We are in the same boat as Bloomberg.
>>
>> (But being a release manager often myself I understand the trade-off)
>>
>> B.
>>
>> Op di 16 okt. 2018 21:24 schreef Ilan Filonenko :
>>
>>> On Erik's note, would SPARK-23257 be included in, say, 2.4.1? When would
>>> the next RC be? I would like to propose the inclusion of the Kerberos
>>> feature sooner rather than later as it would increase Spark-on-K8S adoption
>>> in production workloads while bringing greater feature parity with Yarn and
>>> Mesos. I would like to note that the feature itself is isolated from Core
>>> and isolated via the step-based architecture of the Kubernetes
>>> Driver/Executor builders.
>>>
>>> Furthermore, Spark users traditionally use HDFS for storage and in
>>> production use-cases these HDFS clusters would be kerberized. At Bloomberg,
>>> for example, all of the HDFS clusters are kerberized and for this reason,
>>> the only thing stopping our internal Data Science Platform from adopting
>>> Spark-on-K8S is this feature.
>>>
>>> On Tue, Oct 16, 2018 at 10:21 AM Erik Erlandson 
>>> wrote:
>>>

 SPARK-23257 merged more recently than I realized. If that isn't on
 branch-2.4 then the first question is how soon on the release sequence that
 can be adopted

 On Tue, Oct 16, 2018 at 9:33 AM Reynold Xin 
 wrote:

> We shouldn’t merge new features into release branches anymore.
>
> On Tue, Oct 16, 2018 at 6:32 PM Rob Vesse 
> wrote:
>
>> Right now the Kerberos support for Spark on K8S is only on master
>> AFAICT i.e. the feature is not present on branch-2.4
>>
>>
>>
>> Therefore I don’t see any point in adding the tests into branch-2.4
>> unless the plan is to also merge the Kerberos support to branch-2.4
>>
>>
>>
>> Rob
>>
>>
>>
>> *From: *Erik Erlandson 
>> *Date: *Tuesday, 16 October 2018 at 16:47
>> *To: *dev 
>> *Subject: *[DISCUSS][K8S][TESTS] Include Kerberos integration tests
>> for Spark 2.4
>>
>>
>>
>> I'd like to propose including integration testing for Kerberos on the
>> Spark 2.4 release:
>>
>> https://github.com/apache/spark/pull/22608
>>
>>
>>
>> Arguments in favor:
>>
>> 1) it improves testing coverage on a feature important for
>> integrating with HDFS deployments
>>
>> 2) its intersection with existing code is small - it consists
>> primarily of new testing code, with a bit of refactoring into 'main' and
>> 'test' sub-trees. These new tests appear stable.
>>
>> 3) Spark 2.4 is still in RC, with outstanding correctness issues.
>>
>>
>>
>> The argument 'against' that I'm aware of would be the relatively
>> large size of the PR. I believe this is considered above, but am 
>> soliciting
>> community feedback before committing.
>>
>> Cheers,
>>
>> Erik
>>
>>
>>
>


GitHub is out of order

2018-10-21 Thread Dongjoon Hyun
Hi, All.

Currently, GitHub is out of order. Apache Spark repo is also affected.
Newly filed pull requests to Apache Spark repository seem to disappear
repeatedly, too.

https://status.github.com/messages

Bests,
Dongjoon.


Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-19 Thread Dongjoon Hyun
>From the document, should we be more specific with 'Java 8' instead of
'Java 8+' because we don't build (or test) in the community with Java 9 ~
11.

https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/_site/building-spark.html
> Building Spark using Maven requires Maven 3.3.9 or newer and Java 8+

Bests,
Dongjoon.


On Wed, Oct 10, 2018 at 9:47 AM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until October 1 PST and passes if a majority +1 PMC votes
> are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc3 (commit
> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
> https://github.com/apache/spark/tree/v2.4.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1289
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-19 Thread Dongjoon Hyun
Thank you, Wenchen. I made the minor document-only change PR.

https://github.com/apache/spark/pull/22781

Bests,
Dongjoon.



On Fri, Oct 19, 2018 at 6:07 PM Wenchen Fan  wrote:

> AFAIK we haven't tested Java 9+ yet, so I'm ok to change it.
>
> Hi Dongjoon can you make a PR for it? We can merge it very soon if we
> decide to do it.
>
> Thanks,
> Wenchen
>
> On Sat, Oct 20, 2018 at 5:27 AM Dongjoon Hyun 
> wrote:
>
>> From the document, should we be more specific with 'Java 8' instead of
>> 'Java 8+' because we don't build (or test) in the community with Java 9 ~
>> 11.
>>
>>
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/_site/building-spark.html
>> > Building Spark using Maven requires Maven 3.3.9 or newer and Java 8+
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Oct 10, 2018 at 9:47 AM Wenchen Fan  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.4.0.
>>>
>>> The vote is open until October 1 PST and passes if a majority +1 PMC
>>> votes are cast, with
>>> a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.4.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.4.0-rc3 (commit
>>> 8e4a99bd201b9204fec52580f19ae70a229ed94e):
>>> https://github.com/apache/spark/tree/v2.4.0-rc3
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1289
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc3-docs/
>>>
>>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.4.0?
>>> ===
>>>
>>> The current list of open tickets targeted at 2.4.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 2.4.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>


Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-13 Thread Dongjoon Hyun
Yes. From my side, it's -1 for RC3.

Bests,
Dongjoon.

On Sat, Oct 13, 2018 at 1:24 PM Holden Karau  wrote:

> So if it's a blocker would you think this should be a -1?
>
> On Fri, Oct 12, 2018 at 3:52 PM Dongjoon Hyun 
> wrote:
>
>> Hi, Holden.
>>
>> Since that's a performance at 2.4.0, I marked as `Blocker` four days ago.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Oct 12, 2018 at 11:45 AM Holden Karau 
>> wrote:
>>
>>> Following up I just wanted to make sure this new blocker that Dongjoon
>>> designated is surfaced -
>>> https://jira.apache.org/jira/browse/SPARK-25579?filter=12340409=affectedVersion%20%3D%202.4.0%20AND%20cf%5B12310320%5D%20is%20EMPTY%20AND%20project%20%3D%20spark%20AND%20(status%20%3D%20%22In%20Progress%22%20OR%20resolution%20%3D%20Unresolved)%20AND%20priority%20%3D%20Blocker
>>>
>>> On Thu, Oct 11, 2018 at 2:05 PM Xiao Li  wrote:
>>>
>>>> -1. We have two correctness bugs:
>>>> https://issues.apache.org/jira/browse/SPARK-25714 and
>>>> https://issues.apache.org/jira/browse/SPARK-25708.
>>>>
>>>> Let us fix all the three issues in ScalaUDF, as mentioned by Sean.
>>>>
>>>> Xiao
>>>>
>>>>
>>>> Sean Owen  于2018年10月11日周四 上午9:04写道:
>>>>
>>>>> This is a legitimate question about the behavior of ScalaUDF after the
>>>>> change to support 2.12:
>>>>> https://github.com/apache/spark/pull/22259#discussion_r224295469
>>>>> Not quite a blocker I think, but a potential gotcha we definitely need
>>>>> to highlight in release notes. There may be an argument for changing
>>>>> ScalaUDF again before the release. Have a look, anyone familiar with
>>>>> catalyst.
>>>>> On Wed, Oct 10, 2018 at 3:00 PM Sean Owen  wrote:
>>>>> >
>>>>> > +1. I tested the source build against Scala 2.12 and common build
>>>>> > profiles. License and sigs look OK.
>>>>> >
>>>>> > No blockers; one critical:
>>>>> >
>>>>> > SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4
>>>>> >
>>>>> > I think this one is "won't fix" though? not trying to restore the
>>>>> behavior?
>>>>> >
>>>>> > Other items open for 2.4.0:
>>>>> >
>>>>> > SPARK-25347 Document image data source in doc site
>>>>> > SPARK-25584 Document libsvm data source in doc site
>>>>> > SPARK-25179 Document the features that require Pyarrow 0.10
>>>>> > SPARK-25507 Update documents for the new features in 2.4 release
>>>>> > SPARK-25346 Document Spark builtin data sources
>>>>> > SPARK-24464 Unit tests for MLlib's Instrumentation
>>>>> > SPARK-23197 Flaky test:
>>>>> spark.streaming.ReceiverSuite."receiver_life_cycle"
>>>>> > SPARK-22809 pyspark is sensitive to imports with dots
>>>>> > SPARK-21030 extend hint syntax to support any expression for Python
>>>>> and R
>>>>> >
>>>>> > Anyone know enough to close or retarget them? they don't look
>>>>> critical
>>>>> > for 2.4, SPARK-25507 has no content, itself. SPARK-25179 "Document
>>>>> the
>>>>> > features that require Pyarrow 0.10" however sounds like it could have
>>>>> > been important for 2.4? if not a blocker.
>>>>> >
>>>>> > PS I don't think that SPARK-25150 is an issue; see JIRA. At least
>>>>> > there is some ongoing discussion there.
>>>>> >
>>>>> > I am evaluating
>>>>> > https://github.com/apache/spark/pull/22259#discussion_r224252642
>>>>> right
>>>>> > now.
>>>>> >
>>>>> >
>>>>> > On Wed, Oct 10, 2018 at 9:47 AM Wenchen Fan 
>>>>> wrote:
>>>>> > >
>>>>> > > Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.4.0.
>>>>> > >
>>>>> > > The vote is open until October 1 PST and passes if a majority +1
>>>>> PMC votes are cast, with
>>>>> > > a minimum of 3 +1 votes.
>>>>> > >
>>>>> > > [ ] +1 Release this package as Apache Spark 2.4.0
>>

Re: [VOTE] SPARK 2.4.0 (RC4)

2018-10-23 Thread Dongjoon Hyun
BTW, for that integration suite, I saw the related artifacts in the RC4
staging directory.

Does Spark 2.4.0 need to start to release these
`spark-kubernetes-integration-tests`
artifacts?

   -
   
https://repository.apache.org/content/repositories/orgapachespark-1290/org/apache/spark/spark-kubernetes-integration-tests_2.11/
   -
   
https://repository.apache.org/content/repositories/orgapachespark-1290/org/apache/spark/spark-kubernetes-integration-tests_2.12/

Historically, Spark released `spark-docker-integration-tests` at Spark
1.6.x era and stopped since Spark 2.0.0.

   -
   
http://central.maven.org/maven2/org/apache/spark/spark-docker-integration-tests_2.10/
   -
   
http://central.maven.org/maven2/org/apache/spark/spark-docker-integration-tests_2.11/


Bests,
Dongjoon.

On Tue, Oct 23, 2018 at 11:43 AM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Sean,
>
> Ok makes sense, im using a cloned repo. I built with Scala 2.12 profile
> using the related tag v2.4.0-rc4:
>
> ./dev/change-scala-version.sh 2.12
> ./dev/make-distribution.sh  --name test --r --tgz -Pscala-2.12 -Psparkr
> -Phadoop-2.7 -Pkubernetes -Phive
> Pushed images to dockerhub (previous email) since I didnt use the minikube
> daemon (default behavior).
>
> Then run tests successfully against minikube:
>
> TGZ_PATH=$(pwd)/spark-2.4.0-bin-test.gz
> cd resource-managers/kubernetes/integration-tests
>
> ./dev/dev-run-integration-tests.sh --spark-tgz $TGZ_PATH --service-account
> default --namespace default --image-tag k8s-scala-12 --image-repo skonto
>
>
> [INFO]
> [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) @
> spark-kubernetes-integration-tests_2.12 ---
> Discovery starting.
> Discovery completed in 229 milliseconds.
> Run starting. Expected test count is: 14
> KubernetesSuite:
> - Run SparkPi with no resources
> - Run SparkPi with a very long application name.
> - Use SparkLauncher.NO_RESOURCE
> - Run SparkPi with a master URL without a scheme.
> - Run SparkPi with an argument.
> - Run SparkPi with custom labels, annotations, and environment variables.
> - Run extraJVMOptions check on driver
> - Run SparkRemoteFileTest using a remote data file
> - Run SparkPi with env and mount secrets.
> - Run PySpark on simple pi.py example
> - Run PySpark with Python2 to test a pyfiles example
> - Run PySpark with Python3 to test a pyfiles example
> - Run PySpark with memory customization
> - Run in client mode.
> Run completed in 5 minutes, 24 seconds.
> Total number of tests run: 14
> Suites: completed 2, aborted 0
> Tests: succeeded 14, failed 0, canceled 0, ignored 0, pending 0
> All tests passed.
> [INFO]
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM 2.4.0 . SUCCESS [
> 4.491 s]
> [INFO] Spark Project Tags . SUCCESS [
> 3.833 s]
> [INFO] Spark Project Local DB . SUCCESS [
> 2.680 s]
> [INFO] Spark Project Networking ... SUCCESS [
> 4.817 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
> 2.541 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 2.795 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 5.593 s]
> [INFO] Spark Project Core . SUCCESS [
> 25.160 s]
> [INFO] Spark Project Kubernetes Integration Tests 2.4.0 ... SUCCESS [05:30
> min]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 06:23 min
> [INFO] Finished at: 2018-10-23T18:39:11Z
> [INFO]
> 
>
>
> but had to modify this line
> 
>  and
> added -Pscala-2.12 , otherwise it fails (these tests inherit from the
> parent pom but the profile is not propagated to the mvn command that
> launches the tests, I can create a PR to fix that).
>
>
> On Tue, Oct 23, 2018 at 7:44 PM, Hyukjin Kwon  wrote:
>
>> https://github.com/apache/spark/pull/22514 sounds like a regression that
>> affects Hive CTAS in write path (by not replacing them into Spark internal
>> datasources; therefore performance regression).
>> but yea I suspect if we should block the release by this.
>>
>> https://github.com/apache/spark/pull/22144 is just being discussed if I
>> am not mistaken.
>>
>> Thanks.
>>
>> 2018년 10월 24일 (수) 오전 12:27, Xiao Li 님이 작성:
>>
>>> https://github.com/apache/spark/pull/22144 is also not a blocker of
>>> Spark 2.4 release, as discussed in the PR.
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>> Xiao Li  于2018年10月23日周二 上午9:20写道:
>>>
 Thanks for reporting this. https://github.com/apache/spark/pull/22514
 is 

  1   2   3   4   5   6   7   8   >