Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Nick Pentreath
Wow! end of an era

Thanks so much to you Shane for all you work over 10 (!!) years. And to
Amplab also!

Farewell Spark Jenkins!

N

On Tue, Dec 7, 2021 at 6:49 AM Nicholas Chammas 
wrote:

> Farewell to Jenkins and its classic weather forecast build status icons:
>
> [image: health-80plus.png][image: health-60to79.png][image:
> health-40to59.png][image: health-20to39.png][image: health-00to19.png]
>
> And thank you Shane for all the help over these years.
>
> Will you be nuking all the Jenkins-related code in the repo after the 23rd?
>
> On Mon, Dec 6, 2021 at 3:02 PM shane knapp ☠  wrote:
>
>> hey everyone!
>>
>> after a marathon run of nearly a decade, we're finally going to be
>> shutting down {amp|rise}lab jenkins at the end of this month...
>>
>> the earliest snapshot i could find is from 2013 with builds for spark 0.7:
>>
>> https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/
>>
>> it's been a hell of a run, and i'm gonna miss randomly tweaking the build
>> system, but technology has moved on and running a dedicated set of servers
>> for just one open source project is just too expensive for us here at uc
>> berkeley.
>>
>> if there's interest, i'll fire up a zoom session and all y'alls can watch
>> me type the final command:
>>
>> systemctl stop jenkins
>>
>> feeling bittersweet,
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>


Re: Welcoming six new Apache Spark committers

2021-03-29 Thread Nick Pentreath
Congratulations to all the new committers. Welcome!


On Fri, 26 Mar 2021 at 22:22, Matei Zaharia  wrote:

> Hi all,
>
> The Spark PMC recently voted to add several new committers. Please join me
> in welcoming them to their new role! Our new committers are:
>
> - Maciej Szymkiewicz (contributor to PySpark)
> - Max Gekk (contributor to Spark SQL)
> - Kent Yao (contributor to Spark SQL)
> - Attila Zsolt Piros (contributor to decommissioning and Spark on
> Kubernetes)
> - Yi Wu (contributor to Spark Core and SQL)
> - Gabor Somogyi (contributor to Streaming and security)
>
> All six of them contributed to Spark 3.1 and we’re very excited to have
> them join as committers.
>
> Matei and the Spark PMC
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Nick Pentreath
Congratulations and welcome as Apache Spark committers!

On Wed, 15 Jul 2020 at 06:59, Prashant Sharma  wrote:

> Congratulations all ! It's great to have such committed folks as
> committers. :)
>
> On Wed, Jul 15, 2020 at 9:24 AM Yi Wu  wrote:
>
>> Congrats!!
>>
>> On Wed, Jul 15, 2020 at 8:02 AM Hyukjin Kwon  wrote:
>>
>>> Congrats!
>>>
>>> 2020년 7월 15일 (수) 오전 7:56, Takeshi Yamamuro 님이 작성:
>>>
 Congrats, all!

 On Wed, Jul 15, 2020 at 5:15 AM Takuya UESHIN 
 wrote:

> Congrats and welcome!
>
> On Tue, Jul 14, 2020 at 1:07 PM Bryan Cutler 
> wrote:
>
>> Congratulations and welcome!
>>
>> On Tue, Jul 14, 2020 at 12:36 PM Xingbo Jiang 
>> wrote:
>>
>>> Welcome, Huaxin, Jungtaek, and Dilip!
>>>
>>> Congratulations!
>>>
>>> On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia <
>>> matei.zaha...@gmail.com> wrote:
>>>
 Hi all,

 The Spark PMC recently voted to add several new committers. Please
 join me in welcoming them to their new roles! The new committers are:

 - Huaxin Gao
 - Jungtaek Lim
 - Dilip Biswal

 All three of them contributed to Spark 3.0 and we’re excited to
 have them join the project.

 Matei and the Spark PMC

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>
> --
> Takuya UESHIN
>
>

 --
 ---
 Takeshi Yamamuro

>>>


Re: Revisiting Online serving of Spark models?

2018-06-05 Thread Nick Pentreath
I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.

On Sun, 3 Jun 2018 at 00:24 Holden Karau  wrote:

> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
> maximilianofel...@gmail.com> wrote:
>
>> Hi!
>>
>> We're already in San Francisco waiting for the summit. We even think that
>> we spotted @holdenk this afternoon.
>>
> Unless you happened to be walking by my garage probably not super likely,
> spent the day working on scooters/motorcycles (my style is a little less
> unique in SF :)). Also if you see me feel free to say hi unless I look like
> I haven't had my first coffee of the day, love chatting with folks IRL :)
>
>>
>> @chris, we're really interested in the Meetup you're hosting. My team
>> will probably join it since the beginning of you have room for us, and I'll
>> join it later after discussing the topics on this thread. I'll send you an
>> email regarding this request.
>>
>> Thanks
>>
>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal 
>> escribió:
>>
>>> @Chris This sounds fantastic, please send summary notes for Seattle
>>> folks
>>>
>>> @Felix I work in downtown Seattle, am wondering if we should a tech
>>> meetup around model serving in spark at my work or elsewhere close,
>>> thoughts?  I’m actually in the midst of building microservices to manage
>>> models and when I say models I mean much more than machine learning models
>>> (think OR, process models as well)
>>>
>>> Regards
>>>
>>> Sent from my iPhone
>>>
>>> On May 31, 2018, at 10:32 PM, Chris Fregly  wrote:
>>>
>>> Hey everyone!
>>>
>>> @Felix:  thanks for putting this together.  i sent some of you a quick
>>> calendar event - mostly for me, so i don’t forget!  :)
>>>
>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>>> TensorFlow Meetup*
>>> 
>>>  @5:30pm
>>> on June 6th (same night) here in SF!
>>>
>>> Everybody is welcome to come.  Here’s the link to the meetup that
>>> includes the signup link:
>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>>> 
>>>
>>> We have an awesome lineup of speakers covered a lot of deep, technical
>>> ground.
>>>
>>> For those who can’t attend in person, we’ll be broadcasting live - and
>>> posting the recording afterward.
>>>
>>> All details are in the meetup link above…
>>>
>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>>> welcome to give a talk. I can move things around to make room.
>>>
>>> @joseph:  I’d personally like an update on the direction of the
>>> Databricks proprietary ML Serving export format which is similar to PMML
>>> but not a standard in any way.
>>>
>>> Also, the Databricks ML Serving Runtime is only available to Databricks
>>> customers.  This seems in conflict with the community efforts described
>>> here.  Can you comment on behalf of Databricks?
>>>
>>> Look forward to your response, joseph.
>>>
>>> See you all soon!
>>>
>>> —
>>>
>>>
>>> *Chris Fregly *Founder @ *PipelineAI*  (100,000
>>> Users)
>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>>>  (85,000
>>> Global Members)
>>>
>>>
>>>
>>> *San Francisco - Chicago - Austin -  Washington DC - London - Dusseldorf
>>> *
>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>>> *
>>>
>>>
>>> On May 30, 2018, at 9:32 AM, Felix Cheung 
>>> wrote:
>>>
>>> Hi!
>>>
>>> Thank you! Let’s meet then
>>>
>>> June 6 4pm
>>>
>>> Moscone West Convention Center
>>> 800 Howard Street, San Francisco, CA 94103
>>> 
>>>
>>> Ground floor (outside of conference area - should be available for all)
>>> - we will meet and decide where to go
>>>
>>> (Would not send invite because that would be too much noise for dev@)
>>>
>>> To paraphrase Joseph, we will use this to kick off the discusssion and
>>> post notes after and follow up online. As for Seattle, I would be very
>>> interested to meet in person lateen and discuss ;)
>>>
>>>
>>> _
>>> From: Saikat Kanjilal 
>>> Sent: Tuesday, May 29, 2018 11:46 AM
>>> Subject: Re: Revisiting Online serving of Spark models?
>>> To: Maximiliano Felice 
>>> Cc: Felix Cheung , Holden Karau <
>>> hol...@pigscanfly.ca>, Joseph Bradley , Leif
>>> Walsh , dev 
>>>
>>>
>>> Would love to join but am in Seattle, thoughts on how to make this work?
>>>
>>> Regards
>>>
>>> Sent from my iPhone
>>>
>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>>> maximilianofel...@gmail.com> wrote:
>>>
>>> Big +1 to a meeting with fresh air.
>>>
>>> Could anyone send the invites? I don't really know which is the place
>>> Holden is talking about.
>>>
>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung :
>>>
 You had me at blue bottle!


Re: Welcome Zhenhua Wang as a Spark committer

2018-04-03 Thread Nick Pentreath
Congratulations!

On Tue, 3 Apr 2018 at 05:34 wangzhenhua (G)  wrote:

>
>
> Thanks everyone! It’s my great pleasure to be part of such a professional
> and innovative community!
>
>
>
>
>
> best regards,
>
> -Zhenhua(Xander)
>
>
>


Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-27 Thread Nick Pentreath
+1 (binding)

Built and ran Scala tests with "-Phadoop-2.6 -Pyarn -Phive", all passed.

Python tests passed (also including pyspark-streaming w/kafka-0.8 and flume
packages built)

On Tue, 27 Feb 2018 at 10:09 Felix Cheung  wrote:

> +1
>
> Tested R:
>
> install from package, CRAN tests, manual tests, help check, vignettes check
>
> Filed this https://issues.apache.org/jira/browse/SPARK-23461
> This is not a regression so not a blocker of the release.
>
> Tested this on win-builder and r-hub. On r-hub on multiple platforms
> everything passed. For win-builder tests failed on x86 but passed x64 -
> perhaps due to an intermittent download issue causing a gzip error,
> re-testing now but won’t hold the release on this.
>
> --
> *From:* Nan Zhu 
> *Sent:* Monday, February 26, 2018 4:03:22 PM
> *To:* Michael Armbrust
> *Cc:* dev
> *Subject:* Re: [VOTE] Spark 2.3.0 (RC5)
>
> +1  (non-binding), tested with internal workloads and benchmarks
>
> On Mon, Feb 26, 2018 at 12:09 PM, Michael Armbrust  > wrote:
>
>> +1 all our pipelines have been running the RC for several days now.
>>
>> On Mon, Feb 26, 2018 at 10:33 AM, Dongjoon Hyun 
>> wrote:
>>
>>> +1 (non-binding).
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Mon, Feb 26, 2018 at 9:14 AM, Ryan Blue 
>>> wrote:
>>>
 +1 (non-binding)

 On Sat, Feb 24, 2018 at 4:17 PM, Xiao Li  wrote:

> +1 (binding) in Spark SQL, Core and PySpark.
>
> Xiao
>
> 2018-02-24 14:49 GMT-08:00 Ricardo Almeida <
> ricardo.alme...@actnowib.com>:
>
>> +1 (non-binding)
>>
>> same as previous RC
>>
>> On 24 February 2018 at 11:10, Hyukjin Kwon 
>> wrote:
>>
>>> +1
>>>
>>> 2018-02-24 16:57 GMT+09:00 Bryan Cutler :
>>>
 +1
 Tests passed and additionally ran Arrow related tests and did some
 perf checks with python 2.7.14

 On Fri, Feb 23, 2018 at 6:18 PM, Holden Karau  wrote:

> Note: given the state of Jenkins I'd love to see Bryan Cutler or
> someone with Arrow experience sign off on this release.
>
> On Fri, Feb 23, 2018 at 6:13 PM, Cheng Lian  > wrote:
>
>> +1 (binding)
>>
>> Passed all the tests, looks good.
>>
>> Cheng
>>
>> On 2/23/18 15:00, Holden Karau wrote:
>>
>> +1 (binding)
>> PySpark artifacts install in a fresh Py3 virtual env
>>
>> On Feb 23, 2018 7:55 AM, "Denny Lee" 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough <
>>> joshgoldsboroughs...@gmail.com> wrote:
>>>
 New to testing out Spark RCs for the community but I was able
 to run some of the basic unit tests without error so for what it's 
 worth,
 I'm a +1.

 On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal <
 samee...@apache.org> wrote:

> Please vote on releasing the following candidate as Apache
> Spark version 2.3.0. The vote is open until Tuesday February 27, 
> 2018 at
> 8:00:00 am UTC and passes if a majority of at least 3 PMC +1 
> votes are cast.
>
>
> [ ] +1 Release this package as Apache Spark 2.3.0
>
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see
> https://spark.apache.org/
>
> The tag to be voted on is v2.3.0-rc5:
> https://github.com/apache/spark/tree/v2.3.0-rc5
> (992447fb30ee9ebb3cf794f2d06f4d63a2d792db)
>
> List of JIRA tickets resolved in this release can be found
> here:
> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>
> The release files, including signatures, digests, etc. can be
> found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1266/
>
> The documentation corresponding to this release can be found
> at:
>
> 

Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-14 Thread Nick Pentreath
-1 for me as we elevated https://issues.apache.org/jira/browse/SPARK-23377 to
a Blocker. It should be fixed before release.

On Thu, 15 Feb 2018 at 07:25 Holden Karau  wrote:

> If this is a blocker in your view then the vote thread is an important
> place to mention it. I'm not super sure all of the places these methods are
> used so I'll defer to srowen and folks, but for the ML related implications
> in the past we've allowed people to set the hashing function when we've
> introduced changes.
>
> On Feb 15, 2018 2:08 PM, "mrkm4ntr"  wrote:
>
>> I was advised to post here in the discussion at GitHub. I do not know
>> what to
>> do about the problem that discussions dispersing in two places.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: redundant decision tree model

2018-02-13 Thread Nick Pentreath
There is a long outstanding JIRA issue about it:
https://issues.apache.org/jira/browse/SPARK-3155.

It is probably still a useful feature to have for trees but the priority is
not that high since it may not be that useful for the tree ensemble models.

On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hello community,
> I have recently manually inspected some decision trees computed with Spark
> (2.2.1, but the behavior is the same with the latest code on the repo).
>
> I have observed that the trees are always complete, even if an entire
> subtree leads to the same prediction in its different leaves.
>
> In such case, the root of the subtree, instead of being an InternalNode,
> could simply be a LeafNode with the (shared) prediction.
>
> I know that decision trees computed by scikit-learn share the same
> feature, I understand that this is needed by construction, because you
> realize this redundancy only at the end.
>
> So my question is, why is this "post-pruning" missing?
>
> Three hypothesis:
>
> 1) It is not suitable (for a reason I fail to see)
> 2) Such addition to the code is considered as not worth (in terms of code
> complexity, maybe)
> 3) It has been overlooked, but could be a favorable addition
>
> For clarity, I have managed to isolate a small case to reproduce this, in
> what follows.
>
> This is the dataset:
>
>> +-+-+
>> |label|features |
>> +-+-+
>> |1.0  |[1.0,0.0,1.0]|
>> |1.0  |[0.0,1.0,0.0]|
>> |1.0  |[1.0,1.0,0.0]|
>> |0.0  |[0.0,0.0,0.0]|
>> |1.0  |[1.0,1.0,0.0]|
>> |0.0  |[0.0,1.0,1.0]|
>> |1.0  |[0.0,0.0,0.0]|
>> |0.0  |[0.0,1.0,1.0]|
>> |1.0  |[0.0,1.0,1.0]|
>> |0.0  |[1.0,0.0,0.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> |1.0  |[0.0,1.0,1.0]|
>> |0.0  |[0.0,0.0,1.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> |0.0  |[0.0,0.0,1.0]|
>> |0.0  |[1.0,1.0,1.0]|
>> |0.0  |[1.0,1.0,0.0]|
>> |1.0  |[1.0,1.0,1.0]|
>> |0.0  |[1.0,0.0,1.0]|
>> +-+-+
>
>
> Which generates the following model:
>
> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
>> nodes
>>   If (feature 1 <= 0.5)
>>If (feature 2 <= 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>>Else (feature 2 > 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>>   Else (feature 1 > 0.5)
>>If (feature 2 <= 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 1.0
>> Else (feature 0 > 0.5)
>>  Predict: 1.0
>>Else (feature 2 > 0.5)
>> If (feature 0 <= 0.5)
>>  Predict: 0.0
>> Else (feature 0 > 0.5)
>>  Predict: 0.0
>
>
> As you can see, the following model would be equivalent, but smaller and
>
> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 with 15
>> nodes
>>   If (feature 1 <= 0.5)
>>Predict: 0.0
>>   Else (feature 1 > 0.5)
>>If (feature 2 <= 0.5)
>> Predict: 1.0
>>Else (feature 2 > 0.5)
>> Predict: 0.0
>
>
> This happens pretty often in real cases, and despite the small gain in the
> single model invocation for the "optimized" version, it can become non
> negligible when the number of calls is massive, as one can expect in a Big
> Data context.
>
> I would appreciate your opinion on this matter (if relevant for a PR or
> not, pros/cons etc).
>
> Best regards,
> Alessandro
>


Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Nick Pentreath
All MLlib QA JIRAs resolved. Looks like SparkR too, so from the ML side
that should be everything outstanding.

On Thu, 1 Feb 2018 at 06:21 Yin Huai  wrote:

> seems we are not running tests related to pandas in pyspark tests (see my
> email "python tests related to pandas are skipped in jenkins"). I think we
> should fix this test issue and make sure all tests are good before cutting
> RC3.
>
> On Wed, Jan 31, 2018 at 10:12 AM, Sameer Agarwal 
> wrote:
>
>> Just a quick status update on RC3 -- SPARK-23274
>>  was resolved
>> yesterday and tests have been quite healthy throughout this week and the
>> last. I'll cut the new RC as soon as the remaining blocker (SPARK-23202
>> ) is resolved.
>>
>>
>> On 30 January 2018 at 10:12, Andrew Ash  wrote:
>>
>>> I'd like to nominate SPARK-23274
>>>  as a potential
>>> blocker for the 2.3.0 release as well, due to being a regression from
>>> 2.2.0.  The ticket has a simple repro included, showing a query that works
>>> in prior releases but now fails with an exception in the catalyst optimizer.
>>>
>>> On Fri, Jan 26, 2018 at 10:41 AM, Sameer Agarwal 
>>> wrote:
>>>
 This vote has failed due to a number of aforementioned blockers. I'll
 follow up with RC3 as soon as the 2 remaining (non-QA) blockers are
 resolved: https://s.apache.org/oXKi


 On 25 January 2018 at 12:59, Sameer Agarwal 
 wrote:

>
> Most tests pass on RC2, except I'm still seeing the timeout caused by
>> https://issues.apache.org/jira/browse/SPARK-23055 ; the tests never
>> finish. I followed the thread a bit further and wasn't clear whether it 
>> was
>> subsequently re-fixed for 2.3.0 or not. It says it's resolved along with
>> https://issues.apache.org/jira/browse/SPARK-22908 for 2.3.0 though I
>> am still seeing these tests fail or hang:
>>
>> - subscribing topic by name from earliest offsets (failOnDataLoss:
>> false)
>> - subscribing topic by name from earliest offsets (failOnDataLoss:
>> true)
>>
>
> Sean, while some of these tests were timing out on RC1, we're not
> aware of any known issues in RC2. Both maven (
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/146/testReport/org.apache.spark.sql.kafka010/history/)
> and sbt (
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/123/testReport/org.apache.spark.sql.kafka010/history/)
> historical builds on jenkins for org.apache.spark.sql.kafka010 look fairly
> healthy. If you're still seeing timeouts in RC2, can you create a JIRA 
> with
> any applicable build/env info?
>
>
>
>> On Tue, Jan 23, 2018 at 9:01 AM Sean Owen  wrote:
>>
>>> I'm not seeing that same problem on OS X and /usr/bin/tar. I tried
>>> unpacking it with 'xvzf' and also unzipping it first, and it untarred
>>> without warnings in either case.
>>>
>>> I am encountering errors while running the tests, different ones
>>> each time, so am still figuring out whether there is a real problem or 
>>> just
>>> flaky tests.
>>>
>>> These issues look like blockers, as they are inherently to be
>>> completed before the 2.3 release. They are mostly not done. I suppose 
>>> I'd
>>> -1 on behalf of those who say this needs to be done first, though, we 
>>> can
>>> keep testing.
>>>
>>> SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrella
>>> SPARK-23114 Spark R 2.3 QA umbrella
>>>
>>> Here are the remaining items targeted for 2.3:
>>>
>>> SPARK-15689 Data source API v2
>>> SPARK-20928 SPIP: Continuous Processing Mode for Structured Streaming
>>> SPARK-21646 Add new type coercion rules to compatible with Hive
>>> SPARK-22386 Data Source V2 improvements
>>> SPARK-22731 Add a test for ROWID type to OracleIntegrationSuite
>>> SPARK-22735 Add VectorSizeHint to ML features documentation
>>> SPARK-22739 Additional Expression Support for Objects
>>> SPARK-22809 pyspark is sensitive to imports with dots
>>> SPARK-22820 Spark 2.3 SQL API audit
>>>
>>>
>>> On Mon, Jan 22, 2018 at 7:09 PM Marcelo Vanzin 
>>> wrote:
>>>
 +0

 Signatures check out. Code compiles, although I see the errors in
 [1]
 when untarring the source archive; perhaps we should add "use GNU
 tar"
 to the RM checklist?

 Also ran our internal tests and they seem happy.

 My concern is the list of open bugs targeted at 2.3.0 (ignoring 

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-25 Thread Nick Pentreath
I think this has come up before (and Sean mentions it above), but the
sub-items on:

SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrella

are actually marked as Blockers, but are not targeted to 2.3.0. I think
they should be, and I'm not comfortable with those not being resolved
before voting positively on the release.

So I'm -1 too for that reason.

I think most of those review items are close to done, and there is also
https://issues.apache.org/jira/browse/SPARK-22799 that I think should be in
for 2.3 (to avoid a behavior change later between 2.3.0 and 2.3.1,
especially since we'll have another RC now it seems).


On Thu, 25 Jan 2018 at 19:28 Marcelo Vanzin  wrote:

> Sorry, have to change my vote again. Hive guys ran into SPARK-23209
> and that's a regression we need to fix. I'll post a patch soon. So -1
> (although others have already -1'ed).
>
> On Wed, Jan 24, 2018 at 11:42 AM, Marcelo Vanzin 
> wrote:
> > Given that the bugs I was worried about have been dealt with, I'm
> > upgrading to +1.
> >
> > On Mon, Jan 22, 2018 at 5:09 PM, Marcelo Vanzin 
> wrote:
> >> +0
> >>
> >> Signatures check out. Code compiles, although I see the errors in [1]
> >> when untarring the source archive; perhaps we should add "use GNU tar"
> >> to the RM checklist?
> >>
> >> Also ran our internal tests and they seem happy.
> >>
> >> My concern is the list of open bugs targeted at 2.3.0 (ignoring the
> >> documentation ones). It is not long, but it seems some of those need
> >> to be looked at. It would be nice for the committers who are involved
> >> in those bugs to take a look.
> >>
> >> [1]
> https://superuser.com/questions/318809/linux-os-x-tar-incompatibility-tarballs-created-on-os-x-give-errors-when-unt
> >>
> >>
> >> On Mon, Jan 22, 2018 at 1:36 PM, Sameer Agarwal 
> wrote:
> >>> Please vote on releasing the following candidate as Apache Spark
> version
> >>> 2.3.0. The vote is open until Friday January 26, 2018 at 8:00:00 am
> UTC and
> >>> passes if a majority of at least 3 PMC +1 votes are cast.
> >>>
> >>>
> >>> [ ] +1 Release this package as Apache Spark 2.3.0
> >>>
> >>> [ ] -1 Do not release this package because ...
> >>>
> >>>
> >>> To learn more about Apache Spark, please see https://spark.apache.org/
> >>>
> >>> The tag to be voted on is v2.3.0-rc2:
> >>> https://github.com/apache/spark/tree/v2.3.0-rc2
> >>> (489ecb0ef23e5d9b705e5e5bae4fa3d871bdac91)
> >>>
> >>> List of JIRA tickets resolved in this release can be found here:
> >>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
> >>>
> >>> The release files, including signatures, digests, etc. can be found at:
> >>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc2-bin/
> >>>
> >>> Release artifacts are signed with the following key:
> >>> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>>
> >>> The staging repository for this release can be found at:
> >>>
> https://repository.apache.org/content/repositories/orgapachespark-1262/
> >>>
> >>> The documentation corresponding to this release can be found at:
> >>>
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc2-docs/_site/index.html
> >>>
> >>>
> >>> FAQ
> >>>
> >>> ===
> >>> What are the unresolved issues targeted for 2.3.0?
> >>> ===
> >>>
> >>> Please see https://s.apache.org/oXKi. At the time of writing, there
> are
> >>> currently no known release blockers.
> >>>
> >>> =
> >>> How can I help test this release?
> >>> =
> >>>
> >>> If you are a Spark user, you can help us test this release by taking an
> >>> existing Spark workload and running on this release candidate, then
> >>> reporting any regressions.
> >>>
> >>> If you're working in PySpark you can set up a virtual env and install
> the
> >>> current RC and see if anything important breaks, in the Java/Scala you
> can
> >>> add the staging repository to your projects resolvers and test with
> the RC
> >>> (make sure to clean up the artifact cache before/after so you don't
> end up
> >>> building with a out of date RC going forward).
> >>>
> >>> ===
> >>> What should happen to JIRA tickets still targeting 2.3.0?
> >>> ===
> >>>
> >>> Committers should look at those and triage. Extremely important bug
> fixes,
> >>> documentation, and API tweaks that impact compatibility should be
> worked on
> >>> immediately. Everything else please retarget to 2.3.1 or 2.3.0 as
> >>> appropriate.
> >>>
> >>> ===
> >>> Why is my bug not fixed?
> >>> ===
> >>>
> >>> In order to make timely releases, we will typically not hold the
> release
> >>> unless the bug in question is a regression from 2.2.0. That being
> said, if
> >>> there is something which is a regression from 2.2.0 and has not been
> >>> correctly targeted 

Re: CrossValidation distribution - is it in the roadmap?

2017-11-29 Thread Nick Pentreath
Hi Tomasz

Parallel evaluation for CrossValidation and TrainValidationSplit was added
for Spark 2.3 in https://issues.apache.org/jira/browse/SPARK-19357


On Wed, 29 Nov 2017 at 16:31 Tomasz Dudek 
wrote:

> Hey,
>
> is there a way to make the following code:
>
> val paramGrid = new ParamGridBuilder().//omitted for brevity - lets say we
> have hundreds of param combinations here
>
> val cv = new
> CrossValidator().setNumFolds(3).setEstimator(pipeline).setEstimatorParamMaps(paramGrid)
>
> automatically distribute itself over all the executors? What I mean is
> to simultaneously compute few(or hundreds of it) ML models, instead of
> using all the computation power on just one model at time.
>
> If not, is such behavior in the Spark's road map?
>
> ...if not, do you think a person without prior Spark development
> experience(me) could do it? I'm using SparkML daily, since few months, at
> work. How much time would it take, approximately?
>
> Yours,
> Tomasz
>
>
>


Re: Timeline for Spark 2.3

2017-11-09 Thread Nick Pentreath
+1 I think that’s practical

On Fri, 10 Nov 2017 at 03:13, Erik Erlandson  wrote:

> +1 on extending the deadline. It will significantly improve the logistics
> for upstreaming the Kubernetes back-end.  Also agreed, on the general
> realities of reduced bandwidth over the Nov-Dec holiday season.
> Erik
>
> On Thu, Nov 9, 2017 at 6:03 PM, Matei Zaharia 
> wrote:
>
>> I’m also +1 on extending this to get Kubernetes and other features in.
>>
>> Matei
>>
>> > On Nov 9, 2017, at 4:04 PM, Anirudh Ramanathan
>>  wrote:
>> >
>> > This would help the community on the Kubernetes effort quite a bit -
>> giving us additional time for reviews and testing for the 2.3 release.
>> >
>> > On Thu, Nov 9, 2017 at 3:56 PM, Justin Miller <
>> justin.mil...@protectwise.com> wrote:
>> > That sounds fine to me. I’m hoping that this ticket can make it into
>> Spark 2.3: https://issues.apache.org/jira/browse/SPARK-18016
>> >
>> > It’s causing some pretty considerable problems when we alter the
>> columns to be nullable, but we are OK for now without that.
>> >
>> > Best,
>> > Justin
>> >
>> >> On Nov 9, 2017, at 4:54 PM, Michael Armbrust 
>> wrote:
>> >>
>> >> According to the timeline posted on the website, we are nearing branch
>> cut for Spark 2.3.  I'd like to propose pushing this out towards mid to
>> late December for a couple of reasons and would like to hear what people
>> think.
>> >>
>> >> 1. I've done release management during the Thanksgiving / Christmas
>> time before and in my experience, we don't actually get a lot of testing
>> during this time due to vacations and other commitments. I think beginning
>> the RC process in early January would give us the best coverage in the
>> shortest amount of time.
>> >> 2. There are several large initiatives in progress that given a little
>> more time would leave us with a much more exciting 2.3 release.
>> Specifically, the work on the history server, Kubernetes and continuous
>> processing.
>> >> 3. Given the actual release date of Spark 2.2, I think we'll still get
>> Spark 2.3 out roughly 6 months after.
>> >>
>> >> Thoughts?
>> >>
>> >> Michael
>> >
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
Ah yes - I recall that it was fixed. Forgot it was for 2.3.0

My +1 vote stands.

On Fri, 6 Oct 2017 at 15:15 Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Hi Nick,
>
> I believe that R test failure is due to SPARK-21093, at least the error
> message looks the same, and that is fixed from 2.3.0. This was not
> backported because I and reviewers were worried as that fixed a very core
> to SparkR (even, it was reverted once even after very close look by some
> reviewers).
>
> I asked Michael to note this as a known issue in
> https://spark.apache.org/releases/spark-release-2-2-0.html#known-issues
> before due to this reason.
> I believe It should be fine and probably we should note if possible. I
> believe this should not be a regression anyway as, if I understood
> correctly, it was there from the very first place.
>
> Thanks.
>
>
>
>
> 2017-10-06 21:20 GMT+09:00 Nick Pentreath <nick.pentre...@gmail.com>:
>
>> Checked sigs & hashes.
>>
>> Tested on RHEL
>> build/mvn -Phadoop-2.7 -Phive -Pyarn test passed
>> Python tests passed
>>
>> I ran R tests and am getting some failures:
>> https://gist.github.com/MLnick/ddf4d531d5125208771beee0cc9c697e (I seem
>> to recall similar issues on a previous release but I thought it was fixed).
>>
>> I re-ran R tests on an Ubuntu box to double check and they passed there.
>>
>> So I'd still +1 the release
>>
>> Perhaps someone can take a look at the R failures on RHEL just in case
>> though.
>>
>>
>> On Fri, 6 Oct 2017 at 05:58 vaquar khan <vaquar.k...@gmail.com> wrote:
>>
>>> +1 (non binding ) tested on Ubuntu ,all test case  are passed.
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Thu, Oct 5, 2017 at 10:46 PM, Hyukjin Kwon <gurwls...@gmail.com>
>>> wrote:
>>>
>>>> +1 too.
>>>>
>>>>
>>>> On 6 Oct 2017 10:49 am, "Reynold Xin" <r...@databricks.com> wrote:
>>>>
>>>> +1
>>>>
>>>>
>>>> On Mon, Oct 2, 2017 at 11:24 PM, Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.1.2. The vote is open until Saturday October 7th at 9:00
>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.1.2-rc4
>>>>> <https://github.com/apache/spark/tree/v2.1.2-rc4> (
>>>>> 2abaea9e40fce81cd4626498e0f5c28a70917499)
>>>>>
>>>>> List of JIRA tickets resolved in this release can be found with this
>>>>> filter.
>>>>> <https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.2>
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>>>>>
>>>>> Release artifacts are signed with a key from:
>>>>> https://people.apache.org/~holden/holdens_keys.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1252
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>>>>>
>>>>>
>>>>> *FAQ*
>>>>>
>>>>> *How can I help test this release?*
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC and see if anything important breaks, in the
>>>>> Java/Scala you can add the staging repository to your projects resolvers
>>>>> and test with the RC (make sure to clean up the artifact cache
>>>>> before/after so you don't end up building with a out of date RC going

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
Checked sigs & hashes.

Tested on RHEL
build/mvn -Phadoop-2.7 -Phive -Pyarn test passed
Python tests passed

I ran R tests and am getting some failures:
https://gist.github.com/MLnick/ddf4d531d5125208771beee0cc9c697e (I seem to
recall similar issues on a previous release but I thought it was fixed).

I re-ran R tests on an Ubuntu box to double check and they passed there.

So I'd still +1 the release

Perhaps someone can take a look at the R failures on RHEL just in case
though.


On Fri, 6 Oct 2017 at 05:58 vaquar khan  wrote:

> +1 (non binding ) tested on Ubuntu ,all test case  are passed.
>
> Regards,
> Vaquar khan
>
> On Thu, Oct 5, 2017 at 10:46 PM, Hyukjin Kwon  wrote:
>
>> +1 too.
>>
>>
>> On 6 Oct 2017 10:49 am, "Reynold Xin"  wrote:
>>
>> +1
>>
>>
>> On Mon, Oct 2, 2017 at 11:24 PM, Holden Karau 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.2. The vote is open until Saturday October 7th at 9:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.2-rc4
>>>  (
>>> 2abaea9e40fce81cd4626498e0f5c28a70917499)
>>>
>>> List of JIRA tickets resolved in this release can be found with this
>>> filter.
>>> 
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>>>
>>> Release artifacts are signed with a key from:
>>> https://people.apache.org/~holden/holdens_keys.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1252
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test with
>>> the RC (make sure to clean up the artifact cache before/after so you
>>> don't end up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.3.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1. That being said
>>> if there is something which is a regression form 2.1.1 that has not
>>> been correctly targeted please ping a committer to help target the issue
>>> (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>>> 
>>> )
>>>
>>> *What are the unresolved* issues targeted for 2.1.2
>>> 
>>> ?
>>>
>>> At this time there are no open unresolved issues.
>>>
>>> *Is there anything different about this release?*
>>>
>>> This is the first release in awhile not built on the AMPLAB Jenkins.
>>> This is good because it means future releases can more easily be built and
>>> signed securely (and I've been updating the documentation in
>>> https://github.com/apache/spark-website/pull/66 as I progress), however
>>> the chances of a mistake are higher with any change like this. If there
>>> something you normally take for granted as correct when checking a release,
>>> please double check this time :)
>>>
>>> *Should I be committing code to branch-2.1?*
>>>
>>> Thanks for asking! Please treat this stage in the RC process as "code
>>> freeze" so bug fixes only. If you're uncertain if something should be back
>>> ported please reach out. If you do commit to branch-2.1 please tag your
>>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
>>> 2.1.3 fixed into 2.1.2 as appropriate.
>>>
>>> *What happened to 

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-04 Thread Nick Pentreath
Ah right! Was using a new cloud instance and didn't realize I was logged in
as root! thanks

On Tue, 3 Oct 2017 at 21:13 Marcelo Vanzin <van...@cloudera.com> wrote:

> Maybe you're running as root (or the admin account on your OS)?
>
> On Tue, Oct 3, 2017 at 12:12 PM, Nick Pentreath
> <nick.pentre...@gmail.com> wrote:
> > Hmm I'm consistently getting this error in core tests:
> >
> > - SPARK-3697: ignore directories that cannot be read. *** FAILED ***
> >   2 was not equal to 1 (FsHistoryProviderSuite.scala:146)
> >
> >
> > Anyone else? Any insight? Perhaps it's my set up.
> >
> >>>
> >>>
> >>> On Tue, Oct 3, 2017 at 7:24 AM Holden Karau <hol...@pigscanfly.ca>
> wrote:
> >>>>
> >>>> Please vote on releasing the following candidate as Apache Spark
> version
> >>>> 2.1.2. The vote is open until Saturday October 7th at 9:00 PST and
> passes if
> >>>> a majority of at least 3 +1 PMC votes are cast.
> >>>>
> >>>> [ ] +1 Release this package as Apache Spark 2.1.2
> >>>> [ ] -1 Do not release this package because ...
> >>>>
> >>>>
> >>>> To learn more about Apache Spark, please see
> https://spark.apache.org/
> >>>>
> >>>> The tag to be voted on is v2.1.2-rc4
> >>>> (2abaea9e40fce81cd4626498e0f5c28a70917499)
> >>>>
> >>>> List of JIRA tickets resolved in this release can be found with this
> >>>> filter.
> >>>>
> >>>> The release files, including signatures, digests, etc. can be found
> at:
> >>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
> >>>>
> >>>> Release artifacts are signed with a key from:
> >>>> https://people.apache.org/~holden/holdens_keys.asc
> >>>>
> >>>> The staging repository for this release can be found at:
> >>>>
> https://repository.apache.org/content/repositories/orgapachespark-1252
> >>>>
> >>>> The documentation corresponding to this release can be found at:
> >>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
> >>>>
> >>>>
> >>>> FAQ
> >>>>
> >>>> How can I help test this release?
> >>>>
> >>>> If you are a Spark user, you can help us test this release by taking
> an
> >>>> existing Spark workload and running on this release candidate, then
> >>>> reporting any regressions.
> >>>>
> >>>> If you're working in PySpark you can set up a virtual env and install
> >>>> the current RC and see if anything important breaks, in the
> Java/Scala you
> >>>> can add the staging repository to your projects resolvers and test
> with the
> >>>> RC (make sure to clean up the artifact cache before/after so you
> don't end
> >>>> up building with a out of date RC going forward).
> >>>>
> >>>> What should happen to JIRA tickets still targeting 2.1.2?
> >>>>
> >>>> Committers should look at those and triage. Extremely important bug
> >>>> fixes, documentation, and API tweaks that impact compatibility should
> be
> >>>> worked on immediately. Everything else please retarget to 2.1.3.
> >>>>
> >>>> But my bug isn't fixed!??!
> >>>>
> >>>> In order to make timely releases, we will typically not hold the
> release
> >>>> unless the bug in question is a regression from 2.1.1. That being
> said if
> >>>> there is something which is a regression form 2.1.1 that has not been
> >>>> correctly targeted please ping a committer to help target the issue
> (you can
> >>>> see the open issues listed as impacting Spark 2.1.1 & 2.1.2)
> >>>>
> >>>> What are the unresolved issues targeted for 2.1.2?
> >>>>
> >>>> At this time there are no open unresolved issues.
> >>>>
> >>>> Is there anything different about this release?
> >>>>
> >>>> This is the first release in awhile not built on the AMPLAB Jenkins.
> >>>> This is good because it means future releases can more easily be
> built and
> >>>> signed securely (and I've been updating the documentation in
> >>>> https://github.com/apache/spark-website/pull/66 as I progress),
> however the
> >>>> chances of a mistake are higher with any change like this. If there
> >>>> something you normally take for granted as correct when checking a
> release,
> >>>> please double check this time :)
> >>>>
> >>>> Should I be committing code to branch-2.1?
> >>>>
> >>>> Thanks for asking! Please treat this stage in the RC process as "code
> >>>> freeze" so bug fixes only. If you're uncertain if something should be
> back
> >>>> ported please reach out. If you do commit to branch-2.1 please tag
> your JIRA
> >>>> issue fix version for 2.1.3 and if we cut another RC I'll move the
> 2.1.3
> >>>> fixed into 2.1.2 as appropriate.
> >>>>
> >>>> What happened to RC3?
> >>>>
> >>>> Some R+zinc interactions kept it from getting out the door.
> >>>> --
> >>>> Twitter: https://twitter.com/holdenkarau
> >>
> >>
> >
>
>
>
> --
> Marcelo
>


Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Nick Pentreath
Hmm I'm consistently getting this error in core tests:

- SPARK-3697: ignore directories that cannot be read. *** FAILED ***
  2 was not equal to 1 (FsHistoryProviderSuite.scala:146)


Anyone else? Any insight? Perhaps it's my set up.


>>
>> On Tue, Oct 3, 2017 at 7:24 AM Holden Karau  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.2. The vote is open until Saturday October 7th at 9:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.2-rc4
>>>  (
>>> 2abaea9e40fce81cd4626498e0f5c28a70917499)
>>>
>>> List of JIRA tickets resolved in this release can be found with this
>>> filter.
>>> 
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>>>
>>> Release artifacts are signed with a key from:
>>> https://people.apache.org/~holden/holdens_keys.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1252
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test with
>>> the RC (make sure to clean up the artifact cache before/after so you
>>> don't end up building with a out of date RC going forward).
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.3.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1. That being said
>>> if there is something which is a regression form 2.1.1 that has not
>>> been correctly targeted please ping a committer to help target the issue
>>> (you can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>>> 
>>> )
>>>
>>> *What are the unresolved* issues targeted for 2.1.2
>>> 
>>> ?
>>>
>>> At this time there are no open unresolved issues.
>>>
>>> *Is there anything different about this release?*
>>>
>>> This is the first release in awhile not built on the AMPLAB Jenkins.
>>> This is good because it means future releases can more easily be built and
>>> signed securely (and I've been updating the documentation in
>>> https://github.com/apache/spark-website/pull/66 as I progress), however
>>> the chances of a mistake are higher with any change like this. If there
>>> something you normally take for granted as correct when checking a release,
>>> please double check this time :)
>>>
>>> *Should I be committing code to branch-2.1?*
>>>
>>> Thanks for asking! Please treat this stage in the RC process as "code
>>> freeze" so bug fixes only. If you're uncertain if something should be back
>>> ported please reach out. If you do commit to branch-2.1 please tag your
>>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
>>> 2.1.3 fixed into 2.1.2 as appropriate.
>>>
>>> *What happened to RC3?*
>>>
>>> Some R+zinc interactions kept it from getting out the door.
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>


Re: Should Flume integration be behind a profile?

2017-10-02 Thread Nick Pentreath
I'd agree with #1 or #2. Deprecation now seems fine.

Perhaps this should be raised on the user list also?

And perhaps it makes sense to look at moving the Flume support into Apache
Bahir if there is interest (I've cc'ed Bahir dev list here)? That way the
current state of the connector could keep going for those users who may
need it.

As for examples, for the Kinesis connector the examples now live in the
subproject (see e.g. KinesisWordCountASL under external/kinesis-asl). So we
don't have to completely remove the examples, just move them (this may not
solve the doc issue but at least the examples are still there for anyone
who needs them).

On Mon, 2 Oct 2017 at 06:36 Mridul Muralidharan  wrote:

> I agree, proposal 1 sounds better among the options.
>
> Regards,
> Mridul
>
>
> On Sun, Oct 1, 2017 at 3:50 PM, Reynold Xin  wrote:
> > Probably should do 1, and then it is an easier transition in 3.0.
> >
> > On Sun, Oct 1, 2017 at 1:28 AM Sean Owen  wrote:
> >>
> >> I tried and failed to do this in
> >> https://issues.apache.org/jira/browse/SPARK-22142 because it became
> clear
> >> that the Flume examples would have to be removed to make this work, too.
> >> (Well, you can imagine other solutions with extra source dirs or
> modules for
> >> flume examples enabled by a profile, but that doesn't help the docs and
> is
> >> nontrivial complexity for little gain.)
> >>
> >> It kind of suggests Flume support should be deprecated if it's put
> behind
> >> a profile. Like with Kafka 0.8. (This is why I'm raising it again to the
> >> whole list.)
> >>
> >> Any preferences among:
> >> 1. Put Flume behind a profile, remove examples, deprecate
> >> 2. Put Flume behind a profile, remove examples, but don't deprecate
> >> 3. Punt until Spark 3.0, when this integration would probably be removed
> >> entirely (?)
> >>
> >> On Tue, Sep 26, 2017 at 10:36 AM Sean Owen  wrote:
> >>>
> >>> Not a big deal, but I'm wondering whether Flume integration should at
> >>> least be opt-in and behind a profile? it still sees some use (at least
> on
> >>> our end) but not applicable to the majority of users. Most other
> third-party
> >>> framework integrations are behind a profile, like YARN, Mesos, Kinesis,
> >>> Kafka 0.8, Docker. Just soliciting comments, not arguing for it.
> >>>
> >>> (Well, actually it annoys me that the Flume integration always fails to
> >>> compile in IntelliJ unless you generate the sources manually)
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Welcoming Tejas Patil as a Spark committer

2017-09-30 Thread Nick Pentreath
Congratulations!



>>
>> Matei Zaharia wrote
>> > Hi all,
>> >
>> > The Spark PMC recently added Tejas Patil as a committer on the
>> > project. Tejas has been contributing across several areas of Spark for
>> > a while, focusing especially on scalability issues and SQL. Please
>> > join me in welcoming Tejas!
>> >
>> > Matei
>> >
>> > -
>> > To unsubscribe e-mail:
>>
>> > dev-unsubscribe@.apache
>>
>>
>>
>>
>>
>> -
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: Updates on migration guides

2017-08-30 Thread Nick Pentreath
MLlib has tried quite hard to ensure the migration guide is up to date for
each release. I think generally we catch all breaking and most major
behavior changes

On Wed, 30 Aug 2017 at 17:02, Dongjoon Hyun  wrote:

> +1
>
> On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li  wrote:
>
>> Hi, Devs,
>>
>> Many questions from the open source community are actually caused by the
>> behavior changes we made in each release. So far, the migration guides
>> (e.g.,
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide)
>> were not being properly updated. In the last few releases, multiple
>> behavior changes are not documented in migration guides and even release
>> notes. I propose to do the document updates in the same PRs that introduce
>> the behavior changes. If the contributors can't make it, the committers who
>> merge the PRs need to do it instead. We also can create a dedicated page
>> for migration guides of all the components. Hopefully, this can assist the
>> migration efforts.
>>
>> Thanks,
>>
>> Xiao Li
>>
>
>


Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-03 Thread Nick Pentreath
+1 (binding)

On Mon, 3 Jul 2017 at 11:53 Yanbo Liang  wrote:

> +1
>
> On Mon, Jul 3, 2017 at 5:35 AM, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
>> +1
>>
>> On Sun, Jul 2, 2017 at 11:32 PM, Ricardo Almeida <
>> ricardo.alme...@actnowib.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive
>>> -Phive-thriftserver -Pscala-2.11 on
>>>
>>>- macOS 10.12.5 Java 8 (build 1.8.0_131)
>>>- Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
>>>
>>>
>>>
>>>
>>>
>>> On 1 Jul 2017 02:45, "Michael Armbrust"  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.2.0-rc6
>>>  (
>>> a2c7b2133cfee7fa9abfaa2bfbfb637155466783)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> 
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1245/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1.
>>>
>>>
>>>
>>
>>
>


Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-21 Thread Nick Pentreath
As before, release looks good, all Scala, Python tests pass. R tests fail
with same issue in SPARK-21093 but it's not a blocker.

+1 (binding)


On Wed, 21 Jun 2017 at 01:49 Michael Armbrust 
wrote:

> I will kick off the voting with a +1.
>
> On Tue, Jun 20, 2017 at 4:49 PM, Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc5
>>  (
>> 62e442e73a2fa663892d2edaff5f7d72d7f402ed)
>>
>> List of JIRA tickets resolved can be found with this filter
>> 
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1243/
>>
>> The documentation corresponding to this release can be found at:
>> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>
>


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-21 Thread Nick Pentreath
Thanks, I added the details of my environment to the JIRA (for what it's
worth now, as the issue is identified)

On Wed, 14 Jun 2017 at 11:28 Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Actually, I opened - https://issues.apache.org/jira/browse/SPARK-21093.
>
> 2017-06-14 17:08 GMT+09:00 Hyukjin Kwon <gurwls...@gmail.com>:
>
>> For a shorter reproducer ...
>>
>>
>> df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
>> collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>>
>> And running the below multiple times (5~7):
>>
>> collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>>
>> looks occasionally throwing an error.
>>
>>
>> I will leave here and probably explain more information if a JIRA is
>> open. This does not look a regression anyway.
>>
>>
>>
>> 2017-06-14 16:22 GMT+09:00 Hyukjin Kwon <gurwls...@gmail.com>:
>>
>>>
>>> Per https://github.com/apache/spark/tree/v2.1.1,
>>>
>>> 1. CentOS 7.2.1511 / R 3.3.3 - this test hangs.
>>>
>>> I messed it up a bit while downgrading the R to 3.3.3 (It was an actual
>>> machine not a VM) so it took me a while to re-try this.
>>> I re-built this again and checked the R version is 3.3.3 at least. I
>>> hope this one could double checked.
>>>
>>> Here is the self-reproducer:
>>>
>>> irisDF <- suppressWarnings(createDataFrame (iris))
>>> schema <-  structType(structField("Sepal_Length", "double"),
>>> structField("Avg", "double"))
>>> df4 <- gapply(
>>>   cols = "Sepal_Length",
>>>   irisDF,
>>>   function(key, x) {
>>> y <- data.frame(key, mean(x$Sepal_Width), stringsAsFactors = FALSE)
>>>   },
>>>   schema)
>>> collect(df4)
>>>
>>>
>>>
>>> 2017-06-14 16:07 GMT+09:00 Felix Cheung <felixcheun...@hotmail.com>:
>>>
>>>> Thanks! Will try to setup RHEL/CentOS to test it out
>>>>
>>>> _
>>>> From: Nick Pentreath <nick.pentre...@gmail.com>
>>>> Sent: Tuesday, June 13, 2017 11:38 PM
>>>> Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
>>>> To: Felix Cheung <felixcheun...@hotmail.com>, Hyukjin Kwon <
>>>> gurwls...@gmail.com>, dev <dev@spark.apache.org>
>>>>
>>>> Cc: Sean Owen <so...@cloudera.com>
>>>>
>>>>
>>>> Hi yeah sorry for slow response - I was RHEL and OpenJDK but will have
>>>> to report back later with the versions as am AFK.
>>>>
>>>> R version not totally sure but again will revert asap
>>>> On Wed, 14 Jun 2017 at 05:09, Felix Cheung <felixcheun...@hotmail.com>
>>>> wrote:
>>>>
>>>>> Thanks
>>>>> This was with an external package and unrelated
>>>>>
>>>>>   >> macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
>>>>> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
>>>>>
>>>>> As for CentOS - would it be possible to test against R older than
>>>>> 3.4.0? This is the same error reported by Nick below.
>>>>>
>>>>> _
>>>>> From: Hyukjin Kwon <gurwls...@gmail.com>
>>>>> Sent: Tuesday, June 13, 2017 8:02 PM
>>>>>
>>>>> Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
>>>>> To: dev <dev@spark.apache.org>
>>>>> Cc: Sean Owen <so...@cloudera.com>, Nick Pentreath <
>>>>> nick.pentre...@gmail.com>, Felix Cheung <felixcheun...@hotmail.com>
>>>>>
>>>>>
>>>>>
>>>>> For the test failure on R, I checked:
>>>>>
>>>>>
>>>>> Per https://github.com/apache/spark/tree/v2.2.0-rc4,
>>>>>
>>>>> 1. Windows Server 2012 R2 / R 3.3.1 - passed (
>>>>> https://ci.appveyor.com/project/spark-test/spark/build/755-r-test-v2.2.0-rc4
>>>>> )
>>>>> 2. macOS Sierra 10.12.3 / R 3.4.0 - passed
>>>>> 3. macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
>>>>> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
>>>>> 4. CentOS 7.2.1511 / R 3.4.0 - reproduced (

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-14 Thread Nick Pentreath
Hi yeah sorry for slow response - I was RHEL and OpenJDK but will have to
report back later with the versions as am AFK.

R version not totally sure but again will revert asap
On Wed, 14 Jun 2017 at 05:09, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Thanks
> This was with an external package and unrelated
>
>   >> macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
>
> As for CentOS - would it be possible to test against R older than 3.4.0?
> This is the same error reported by Nick below.
>
> _
> From: Hyukjin Kwon <gurwls...@gmail.com>
> Sent: Tuesday, June 13, 2017 8:02 PM
>
> Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)
> To: dev <dev@spark.apache.org>
> Cc: Sean Owen <so...@cloudera.com>, Nick Pentreath <
> nick.pentre...@gmail.com>, Felix Cheung <felixcheun...@hotmail.com>
>
>
>
> For the test failure on R, I checked:
>
>
> Per https://github.com/apache/spark/tree/v2.2.0-rc4,
>
> 1. Windows Server 2012 R2 / R 3.3.1 - passed (
> https://ci.appveyor.com/project/spark-test/spark/build/755-r-test-v2.2.0-rc4
> )
> 2. macOS Sierra 10.12.3 / R 3.4.0 - passed
> 3. macOS Sierra 10.12.3 / R 3.2.3 - passed with a warning (
> https://gist.github.com/HyukjinKwon/85cbcfb245825852df20ed6a9ecfd845)
> 4. CentOS 7.2.1511 / R 3.4.0 - reproduced (
> https://gist.github.com/HyukjinKwon/2a736b9f80318618cc147ac2bb1a987d)
>
>
> Per https://github.com/apache/spark/tree/v2.1.1,
>
> 1. CentOS 7.2.1511 / R 3.4.0 - reproduced (
> https://gist.github.com/HyukjinKwon/6064b0d10bab8fc1dc6212452d83b301)
>
>
> This looks being failed only in CentOS 7.2.1511 / R 3.4.0 given my tests
> and observations.
>
> This is failed in Spark 2.1.1. So, it sounds not a regression although it
> is a bug that should be fixed (whether in Spark or R).
>
>
> 2017-06-14 8:28 GMT+09:00 Xiao Li <gatorsm...@gmail.com>:
>
>> -1
>>
>> Spark 2.2 is unable to read the partitioned table created by Spark 2.1 or
>> earlier.
>>
>> Opened a JIRA https://issues.apache.org/jira/browse/SPARK-21085
>>
>> Will fix it soon.
>>
>> Thanks,
>>
>> Xiao Li
>>
>>
>>
>> 2017-06-13 9:39 GMT-07:00 Joseph Bradley <jos...@databricks.com>:
>>
>>> Re: the QA JIRAs:
>>> Thanks for discussing them.  I still feel they are very helpful; I
>>> particularly notice not having to spend a solid 2-3 weeks of time QAing
>>> (unlike in earlier Spark releases).  One other point not mentioned above: I
>>> think they serve as a very helpful reminder/training for the community for
>>> rigor in development.  Since we instituted QA JIRAs, contributors have been
>>> a lot better about adding in docs early, rather than waiting until the end
>>> of the cycle (though I know this is drawing conclusions from correlations).
>>>
>>> I would vote in favor of the RC...but I'll wait to see about the
>>> reported failures.
>>>
>>> On Fri, Jun 9, 2017 at 3:30 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Different errors as in
>>>> https://issues.apache.org/jira/browse/SPARK-20520 but that's also
>>>> reporting R test failures.
>>>>
>>>> I went back and tried to run the R tests and they passed, at least on
>>>> Ubuntu 17 / R 3.3.
>>>>
>>>>
>>>> On Fri, Jun 9, 2017 at 9:12 AM Nick Pentreath <nick.pentre...@gmail.com>
>>>> wrote:
>>>>
>>>>> All Scala, Python tests pass. ML QA and doc issues are resolved (as
>>>>> well as R it seems).
>>>>>
>>>>> However, I'm seeing the following test failure on R consistently:
>>>>> https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72
>>>>>
>>>>>
>>>>> On Thu, 8 Jun 2017 at 08:48 Denny Lee <denny.g@gmail.com> wrote:
>>>>>
>>>>>> +1 non-binding
>>>>>>
>>>>>> Tested on macOS Sierra, Ubuntu 16.04
>>>>>> test suite includes various test cases including Spark SQL, ML,
>>>>>> GraphFrames, Structured Streaming
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 7, 2017 at 9:40 PM vaquar khan <vaquar.k...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 non-binding
>>>>>>>
>>>>>>> Regards,
>>>>>>> vaquar khan
&g

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-09 Thread Nick Pentreath
All Scala, Python tests pass. ML QA and doc issues are resolved (as well as
R it seems).

However, I'm seeing the following test failure on R consistently:
https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72


On Thu, 8 Jun 2017 at 08:48 Denny Lee  wrote:

> +1 non-binding
>
> Tested on macOS Sierra, Ubuntu 16.04
> test suite includes various test cases including Spark SQL, ML,
> GraphFrames, Structured Streaming
>
>
> On Wed, Jun 7, 2017 at 9:40 PM vaquar khan  wrote:
>
>> +1 non-binding
>>
>> Regards,
>> vaquar khan
>>
>> On Jun 7, 2017 4:32 PM, "Ricardo Almeida" 
>> wrote:
>>
>> +1 (non-binding)
>>
>> Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive
>> -Phive-thriftserver -Pscala-2.11 on
>>
>>- Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
>>- macOS 10.12.5 Java 8 (build 1.8.0_131)
>>
>>
>> On 5 June 2017 at 21:14, Michael Armbrust  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.2.0. The vote is open until Thurs, June 8th, 2017 at 12:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.2.0-rc4
>>>  (
>>> 377cfa8ac7ff7a8a6a6d273182e18ea7dc25ce7e)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> 
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1241/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1.
>>>
>>
>>
>>


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Nick Pentreath
Now, on the subject of (ML) QA JIRAs.

>From the ML side, I believe they are required (I think others such as
Joseph will agree and in fact have already said as much).

Most are marked as Blockers, though of those the Python API coverage is
strictly not a Blocker as we will never hold the release for API parity
issues (unless of course there is some critical bug or missing thing, but
that really falls under the standard RC bug triage process).

I believe they are Blockers, since they involve auditing binary compat and
new public APIs, visibility issues, Java compat etc. I think it's obvious
that a RC should not pass if these have not been checked.

I actually agree that docs and user guide are absolutely part of the
release, and in fact are one of the more important pieces of the release.
Apart from the issues Sean mentions, not treating these things are critical
issues or even blockers is what inevitably over time leads to the user
guide being out of date, missing important features, etc.

In practice for ML at least we definitely aim to have all the doc / guide
issues done before the final release.

Now in terms of process, none of these QA issues really require an RC, they
can all be carried out once the release branch is cut. Some of the issues
like binary compat are perhaps a bit more tricky but inevitably involves
manually checking through MiMa exclusions added, to verify they are ok, etc
- so again an actual RC is not required here.

So really the answer is to more aggressively burn down these QA issues the
moment the release branch has been cut. Again, I think this echoes what
Joseph has said in previous threads.



On Tue, 6 Jun 2017 at 10:16 Sean Owen  wrote:

> On Tue, Jun 6, 2017 at 1:06 AM Michael Armbrust 
> wrote:
>
>> Regarding the readiness of this and previous RCs.  I did cut RC1 & RC2
>> knowing that they were unlikely to pass.  That said, I still think these
>> early RCs are valuable. I know several users that wanted to test new
>> features in 2.2 that have used them.  Now, if we would prefer to call them
>> preview or RC0 or something I'd be okay with that as well.
>>
>
> They are valuable, I only suggest it's better to note explicitly when
> there are blockers or must-do tasks that will fail the RC. It makes a big
> difference to whether one would like to +1.
>
> I meant more than just calling them something different. An early RC could
> be voted as a released 'preview' artifact, at the start of the notional QA
> period, with a lower bar to passing, and releasable with known issues. This
> encourages more testing. It also resolves the controversy about whether
> it's OK to include an RC in a product (separate thread).
>
>
> Regarding doc updates, I don't think it is a requirement that they be
>> voted on as part of the release.  Even if they are something version
>> specific.  I think we have regularly updated the website with documentation
>> that was merged after the release.
>>
>
> They're part of the source release too, as markdown, and should be voted
> on. I've never understood otherwise. Have we actually released docs and
> then later changed them, so that they don't match the release? I don't
> recall that, but I do recall updating the non-version-specific website.
>
> Aside from the oddity of having docs generated from x.y source not match
> docs published for x.y, you want the same protections for doc source that
> the project distributes as anything else. It's not just correctness, but
> liability. The hypothetical is always that someone included copyrighted
> text or something without permission and now the project can't rely on the
> argument that it made a good-faith effort to review what it released on the
> site. Someone becomes personally liable.
>
> These are pretty technical reasons though. More practically, what's the
> hurry to release if docs aren't done (_if_ they're not done)? It's being
> presented as normal practice, but seems quite exceptional.
>
>
>
>> I personally don't think the QA umbrella JIRAs are particularly
>> effective, but I also wouldn't ban their use if others think they are.
>> However, I do think that real QA needs an RC to test, so I think it is fine
>> that there is still outstanding QA to be done when an RC is cut.  For
>> example, I plan to run a bunch of streaming workloads on RC4 and will vote
>> accordingly.
>>
>
> QA on RCs is great (see above). The problem is, I can't distinguish
> between a JIRA that means "we must test in general", which sounds like
> something you too would ignore, and one that means "there is specific
> functionality we have to check before a release that we haven't looked at
> yet", which is a committer waving a flag that they implicitly do not want a
> release until resolved. I wouldn't +1 a release that had a Blocker software
> defect one of us reported.
>
> I know I'm harping on this, but this is the one mechanism we do use
> consistently (Blocker JIRAs) to clearly 

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Nick Pentreath
The website updates for ML QA (SPARK-20507) are not *actually* critical as
the project website certainly can be updated separately from the source
code guide and is not part of the release to be voted on. In future that
particular work item for the QA process could be marked down in priority,
and is definitely not a release blocker.

In any event I just resolved SPARK-20507, as I don't believe any website
updates are required for this release anyway. That fully resolves the ML QA
umbrella (SPARK-20499).


On Tue, 6 Jun 2017 at 10:16 Sean Owen  wrote:

> On Tue, Jun 6, 2017 at 1:06 AM Michael Armbrust 
> wrote:
>
>> Regarding the readiness of this and previous RCs.  I did cut RC1 & RC2
>> knowing that they were unlikely to pass.  That said, I still think these
>> early RCs are valuable. I know several users that wanted to test new
>> features in 2.2 that have used them.  Now, if we would prefer to call them
>> preview or RC0 or something I'd be okay with that as well.
>>
>
> They are valuable, I only suggest it's better to note explicitly when
> there are blockers or must-do tasks that will fail the RC. It makes a big
> difference to whether one would like to +1.
>
> I meant more than just calling them something different. An early RC could
> be voted as a released 'preview' artifact, at the start of the notional QA
> period, with a lower bar to passing, and releasable with known issues. This
> encourages more testing. It also resolves the controversy about whether
> it's OK to include an RC in a product (separate thread).
>
>
> Regarding doc updates, I don't think it is a requirement that they be
>> voted on as part of the release.  Even if they are something version
>> specific.  I think we have regularly updated the website with documentation
>> that was merged after the release.
>>
>
> They're part of the source release too, as markdown, and should be voted
> on. I've never understood otherwise. Have we actually released docs and
> then later changed them, so that they don't match the release? I don't
> recall that, but I do recall updating the non-version-specific website.
>
> Aside from the oddity of having docs generated from x.y source not match
> docs published for x.y, you want the same protections for doc source that
> the project distributes as anything else. It's not just correctness, but
> liability. The hypothetical is always that someone included copyrighted
> text or something without permission and now the project can't rely on the
> argument that it made a good-faith effort to review what it released on the
> site. Someone becomes personally liable.
>
> These are pretty technical reasons though. More practically, what's the
> hurry to release if docs aren't done (_if_ they're not done)? It's being
> presented as normal practice, but seems quite exceptional.
>
>
>
>> I personally don't think the QA umbrella JIRAs are particularly
>> effective, but I also wouldn't ban their use if others think they are.
>> However, I do think that real QA needs an RC to test, so I think it is fine
>> that there is still outstanding QA to be done when an RC is cut.  For
>> example, I plan to run a bunch of streaming workloads on RC4 and will vote
>> accordingly.
>>
>
> QA on RCs is great (see above). The problem is, I can't distinguish
> between a JIRA that means "we must test in general", which sounds like
> something you too would ignore, and one that means "there is specific
> functionality we have to check before a release that we haven't looked at
> yet", which is a committer waving a flag that they implicitly do not want a
> release until resolved. I wouldn't +1 a release that had a Blocker software
> defect one of us reported.
>
> I know I'm harping on this, but this is the one mechanism we do use
> consistently (Blocker JIRAs) to clearly communicate about issues vital to a
> go / no-go release decision, and I think this interferes. The rest of JIRA
> noise doesn't matter much. You can see we're already resorting to secondary
> communications as a result ("anyone have any issues that need to be fixed
> before I cut another RC?" emails) because this is kind of ignored, and
> think we're swapping out a decent mechanism for worse one.
>
> I suspect, as you do, that there's no to-do here in which case they should
> be resolved and we're still on track for release. I'd wait on +1 until then.
>
>


Re: RDD MLLib Deprecation Question

2017-05-30 Thread Nick Pentreath
The short answer is those distributed linalg parts will not go away.

In the medium term, it's much less likely that the distributed matrix
classes will be ported over to DataFrames (though the ideal would be to
have DataFrame-backed distributed matrix classes) - given the time and
effort it's taken just to port the various ML models and feature
transformers over to ML.

The current distributed matrices use the old mllib linear algebra
primitives for backing datastructures and ops, so those will have to be
ported at some point to the ml package vectors & matrices, though overall
functionality would remain the same initially I would expect.

There is https://issues.apache.org/jira/browse/SPARK-15882 that discusses
some of the ideas. The decision would still need to be made on the
higher-level API (whether it remains the same is current, or changes to be
DF-based, and/or changed in other ways, etc)

On Tue, 30 May 2017 at 15:33 John Compitello 
wrote:

> Hey all,
>
> I see on the MLLib website that there are plans to deprecate the RDD based
> API for MLLib once the new ML API reaches feature parity with RDD based
> one. Are there currently plans to reimplement all the distributed linear
> algebra / matrices operations as part of this new API, or are these things
> just going away? Like, will there still be a BlockMatrix class for
> distributed multiplies?
>
> Best,
>
> John
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-19 Thread Nick Pentreath
All the outstanding ML QA doc and user guide items are done for 2.2 so from
that side we should be good to cut another RC :)

On Thu, 18 May 2017 at 00:18 Russell Spitzer 
wrote:

> Seeing an issue with the DataScanExec and some of our integration tests
> for the SCC. Running dataframe read and writes from the shell seems fine
> but the Redaction code seems to get a "None" when doing
> SparkSession.getActiveSession.get in our integration tests. I'm not sure
> why but i'll dig into this later if I get a chance.
>
> Example Failed Test
>
> https://github.com/datastax/spark-cassandra-connector/blob/v2.0.1/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/sql/CassandraSQLSpec.scala#L311
>
> ```[info]   org.apache.spark.SparkException: Job aborted due to stage
> failure: Task serialization failed: java.util.NoSuchElementException:
> None.get
> [info] java.util.NoSuchElementException: None.get
> [info] at scala.None$.get(Option.scala:347)
> [info] at scala.None$.get(Option.scala:345)
> [info] at org.apache.spark.sql.execution.DataSourceScanExec$class.org
> $apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
> [info] at
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
> [info] at
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
> ```
>
> Again this only seems to repo in our IT suite so i'm not sure if this is a
> real issue.
>
>
> On Tue, May 16, 2017 at 1:40 PM Joseph Bradley 
> wrote:
>
>> All of the ML/Graph/SparkR QA blocker JIRAs have been resolved.  Thanks
>> everyone who helped out on those!
>>
>> We still have open ML/Graph/SparkR JIRAs targeted at 2.2, but they are
>> essentially all for documentation.
>>
>> Joseph
>>
>> On Thu, May 11, 2017 at 3:08 PM, Marcelo Vanzin 
>> wrote:
>>
>>> Since you'll be creating a new RC, I'd wait until SPARK-20666 is
>>> fixed, since the change that caused it is in branch-2.2. Probably a
>>> good idea to raise it to blocker at this point.
>>>
>>> On Thu, May 11, 2017 at 2:59 PM, Michael Armbrust
>>>  wrote:
>>> > I'm going to -1 given the outstanding issues and lack of +1s.  I'll
>>> create
>>> > another RC once ML has had time to take care of the more critical
>>> problems.
>>> > In the meantime please keep testing this release!
>>> >
>>> > On Tue, May 9, 2017 at 2:00 AM, Kazuaki Ishizaki 
>>> > wrote:
>>> >>
>>> >> +1 (non-binding)
>>> >>
>>> >> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests
>>> for
>>> >> core have passed.
>>> >>
>>> >> $ java -version
>>> >> openjdk version "1.8.0_111"
>>> >> OpenJDK Runtime Environment (build
>>> >> 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
>>> >> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>>> >> $ build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7
>>> >> package install
>>> >> $ build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl
>>> core
>>> >> ...
>>> >> Run completed in 15 minutes, 12 seconds.
>>> >> Total number of tests run: 1940
>>> >> Suites: completed 206, aborted 0
>>> >> Tests: succeeded 1940, failed 0, canceled 4, ignored 8, pending 0
>>> >> All tests passed.
>>> >> [INFO]
>>> >>
>>> 
>>> >> [INFO] BUILD SUCCESS
>>> >> [INFO]
>>> >>
>>> 
>>> >> [INFO] Total time: 16:51 min
>>> >> [INFO] Finished at: 2017-05-09T17:51:04+09:00
>>> >> [INFO] Final Memory: 53M/514M
>>> >> [INFO]
>>> >>
>>> 
>>> >> [WARNING] The requested profile "hive" could not be activated because
>>> it
>>> >> does not exist.
>>> >>
>>> >>
>>> >> Kazuaki Ishizaki,
>>> >>
>>> >>
>>> >>
>>> >> From:Michael Armbrust 
>>> >> To:"dev@spark.apache.org" 
>>> >> Date:2017/05/05 02:08
>>> >> Subject:[VOTE] Apache Spark 2.2.0 (RC2)
>>> >> 
>>> >>
>>> >>
>>> >>
>>> >> Please vote on releasing the following candidate as Apache Spark
>>> version
>>> >> 2.2.0. The vote is open until Tues, May 9th, 2017 at 12:00 PST and
>>> passes if
>>> >> a majority of at least 3 +1 PMC votes are cast.
>>> >>
>>> >> [ ] +1 Release this package as Apache Spark 2.2.0
>>> >> [ ] -1 Do not release this package because ...
>>> >>
>>> >>
>>> >> To learn more about Apache Spark, please see http://spark.apache.org/
>>> >>
>>> >> The tag to be voted on is v2.2.0-rc2
>>> >> (1d4017b44d5e6ad156abeaae6371747f111dd1f9)
>>> >>
>>> >> List of JIRA tickets resolved can be found with this filter.
>>> >>
>>> >> The release files, including signatures, digests, etc. can be found
>>> at:
>>> >> 

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-02 Thread Nick Pentreath
I won't +1 just given that it seems certain there will be another RC and
there are the outstanding ML QA blocker issues.

But clean build and test for JVM and Python tests LGTM on CentOS Linux
7.2.1511, OpenJDK 1.8.0_111

On Mon, 1 May 2017 at 22:42 Frank Austin Nothaft 
wrote:

> Hi Ryan,
>
> IMO, the problem is that the Spark Avro version conflicts with the Parquet
> Avro version. As discussed upthread, I don’t think there’s a way to
> *reliably *make sure that Avro 1.8 is on the classpath first while using
> spark-submit. Relocating avro in our project wouldn’t solve the problem,
> because the MethodNotFoundError is thrown from the internals of the
> ParquetAvroOutputFormat, not from code in our project.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 12:33 PM, Ryan Blue  wrote:
>
> Michael, I think that the problem is with your classpath.
>
> Spark has a dependency to 1.7.7, which can't be changed. Your project is
> what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime
> dependency on Avro 1.8. It is understandably annoying that using the same
> version of Parquet for your parquet-avro dependency is what causes your
> project to depend on Avro 1.8, but Spark's dependencies aren't a problem
> because its Parquet dependency doesn't bring in Avro.
>
> There are a few ways around this:
> 1. Make sure Avro 1.8 is found in the classpath first
> 2. Shade Avro 1.8 in your project (assuming Avro classes aren't shared)
> 3. Use parquet-avro 1.8.1 in your project, which I think should work with
> 1.8.2 and avoid the Avro change
>
> The work-around in Spark is for tests, which do use parquet-avro. We can
> look at a Parquet 1.8.3 that avoids this issue, but I think this is
> reasonable for the 2.2.0 release.
>
> rb
>
> On Mon, May 1, 2017 at 12:08 PM, Michael Heuer  wrote:
>
>> Please excuse me if I'm misunderstanding -- the problem is not with our
>> library or our classpath.
>>
>> There is a conflict within Spark itself, in that Parquet 1.8.2 expects to
>> find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark
>> already has to work around this for unit tests to pass.
>>
>>
>>
>> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue  wrote:
>>
>>> Thanks for the extra context, Frank. I agree that it sounds like your
>>> problem comes from the conflict between your Jars and what comes with
>>> Spark. Its the same concern that makes everyone shudder when anything has a
>>> public dependency on Jackson. :)
>>>
>>> What we usually do to get around situations like this is to relocate the
>>> problem library inside the shaded Jar. That way, Spark uses its version of
>>> Avro and your classes use a different version of Avro. This works if you
>>> don't need to share classes between the two. Would that work for your
>>> situation?
>>>
>>> rb
>>>
>>> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers 
>>> wrote:
>>>
 sounds like you are running into the fact that you cannot really put
 your classes before spark's on classpath? spark's switches to support this
 never really worked for me either.

 inability to control the classpath + inconsistent jars => trouble ?

 On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
 fnoth...@berkeley.edu> wrote:

> Hi Ryan,
>
> We do set Avro to 1.8 in our downstream project. We also set Spark as
> a provided dependency, and build an überjar. We run via spark-submit, 
> which
> builds the classpath with our überjar and all of the Spark deps. This 
> leads
> to avro 1.7.1 getting picked off of the classpath at runtime, which causes
> the no such method exception to occur.
>
> Regards,
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 11:31 AM, Ryan Blue  wrote:
>
> Frank,
>
> The issue you're running into is caused by using parquet-avro with
> Avro 1.7. Can't your downstream project set the Avro dependency to 1.8?
> Spark can't update Avro because it is a breaking change that would force
> users to rebuilt specific Avro classes in some cases. But you should be
> free to use Avro 1.8 to avoid the problem.
>
> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
> fnoth...@berkeley.edu> wrote:
>
>> Hi Ryan et al,
>>
>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
>> downstream project is that parquet-avro uses one of the new Avro 1.8.0
>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
>> dependency. My colleague Michael (who posted earlier on this thread)
>> documented this in Spark-19697
>> 

Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Nick Pentreath
As for SPARK-19759 , I
don't think that needs to be targeted for 2.1.1 so we don't need to worry
about it

On Tue, 21 Mar 2017 at 13:49 Holden Karau  wrote:

> I agree with Michael, I think we've got some outstanding issues but none
> of them seem like regression from 2.1 so we should be good to start the RC
> process.
>
> On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust 
> wrote:
>
> Please speak up if I'm wrong, but none of these seem like critical
> regressions from 2.1.  As such I'll start the RC process later today.
>
> On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau 
> wrote:
>
> I'm not super sure it should be a blocker for 2.1.1 -- is it a regression?
> Maybe we can get TDs input on it?
>
> On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu  wrote:
>
> I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
> blocker
>
> Best,
>
> Nan
>
> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung 
> wrote:
>
> I've been scrubbing R and think we are tracking 2 issues
>
> https://issues.apache.org/jira/browse/SPARK-19237
>
> https://issues.apache.org/jira/browse/SPARK-19925
>
>
>
>
> --
> *From:* holden.ka...@gmail.com  on behalf of
> Holden Karau 
> *Sent:* Monday, March 20, 2017 3:12:35 PM
> *To:* dev@spark.apache.org
> *Subject:* Outstanding Spark 2.1.1 issues
>
> Hi Spark Developers!
>
> As we start working on the Spark 2.1.1 release I've been looking at our
> outstanding issues still targeted for it. I've tried to break it down by
> component so that people in charge of each component can take a quick look
> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
> the overall list is pretty short (only 9 items - 5 if we only look at
> explicitly tagged) :)
>
> If your working on something for Spark 2.1.1 and it doesn't show up in
> this list please speak up now :) We have a lot of issues (including "in
> progress") that are listed as impacting 2.1.0, but they aren't targeted for
> 2.1.1 - if there is something you are working in their which should be
> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>
> The query string I used for looking at the 2.1.1 open issues is:
>
> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1
> OR cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved
> ORDER BY priority DESC
>
> None of the open issues appear to be a regression from 2.1.0, but those
> seem more likely to show up during the RC process (thanks in advance to
> everyone testing their workloads :)) & generally none of them seem to be
>
> (Note: the cfs are for Target Version/s field)
>
> Critical Issues:
>  SQL:
>   SPARK-19690  - Join
> a streaming DataFrame with a batch DataFrame may not work - PR
> https://github.com/apache/spark/pull/17052 (review in progress by
> zsxwing, currently failing Jenkins)*
>
> Major Issues:
>  SQL:
>   SPARK-19035  - rand()
> function in case when cause failed - no outstanding PR (consensus on JIRA
> seems to be leaning towards it being a real issue but not necessarily
> everyone agrees just yet - maybe we should slip this?)*
>  Deploy:
>   SPARK-19522  - 
> --executor-memory
> flag doesn't work in local-cluster mode -
> https://github.com/apache/spark/pull/16975 (review in progress by vanzin,
> but PR currently stalled waiting on response) *
>  Core:
>   SPARK-20025  - Driver
> fail over will not work, if SPARK_LOCAL* env is set. -
> https://github.com/apache/spark/pull/17357 (waiting on review) *
>  PySpark:
>  SPARK-19955  - Update
> run-tests to support conda [ Part of Dropping 2.6 support -- which we
> shouldn't do in a minor release -- but also fixes pip installability tests
> to run in Jenkins ]-  PR failing Jenkins (I need to poke this some more,
> but seems like 2.7 support works but some other issues. Maybe slip to 2.2?)
>
> Minor issues:
>  Tests:
>   SPARK-19612  - Tests
> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
> consider explicitly targeting this for 2.2?
>  PySpark:
>   SPARK-19570  - Allow
> to disable hive in pyspark shell -
> https://github.com/apache/spark/pull/16906 PR exists but its difficult to
> add automated tests for this (although if SPARK-19955
>  gets in would make
> testing this easier) - no reviewers yet. Possible 

Re: Should we consider a Spark 2.1.1 release?

2017-03-16 Thread Nick Pentreath
Spark 1.5.1 had 87 issues fix version 1 month after 1.5.0.

Spark 1.6.1 had 123 issues 2 months after 1.6.0

2.0.1 was larger (317 issues) at 3 months after 2.0.0 - makes sense due to
how large a release it was.

We are at 185 for 2.1.1 and 3 months after (and not released yet so it
could slip further) - so not totally unusual as the release interval has
certainly increased, but in fairness probably a bit later than usual. I'd
say definitely makes sense to cut the RC!



On Thu, 16 Mar 2017 at 02:06 Michael Armbrust 
wrote:

> Hey Holden,
>
> Thanks for bringing this up!  I think we usually cut patch releases when
> there are enough fixes to justify it.  Sometimes just a few weeks after the
> release.  I guess if we are at 3 months Spark 2.1.0 was a pretty good
> release :)
>
> That said, it is probably time. I was about to start thinking about 2.2 as
> well (we are a little past the posted code-freeze deadline), so I'm happy
> to push the buttons etc (this is a very good description
>  if you are curious). I
> would love help watching JIRA, posting the burn down on issues and
> shepherding in any critical patches.  Feel free to ping me off-line if you
> like to coordinate.
>
> Unless there are any objections, how about we aim for an RC of 2.1.1 on
> Monday and I'll also plan to cut branch-2.2 then?  (I'll send a separate
> email on this as well).
>
> Michael
>
> On Mon, Mar 13, 2017 at 1:40 PM, Holden Karau 
> wrote:
>
> I'd be happy to do the work of coordinating a 2.1.1 release if that's a
> thing a committer can do (I think the release coordinator for the most
> recent Arrow release was a committer and the final publish step took a PMC
> member to upload but other than that I don't remember any issues).
>
> On Mon, Mar 13, 2017 at 1:05 PM Sean Owen  wrote:
>
> It seems reasonable to me, in that other x.y.1 releases have followed ~2
> months after the x.y.0 release and it's been about 3 months since 2.1.0.
>
> Related: creating releases is tough work, so I feel kind of bad voting for
> someone else to do that much work. Would it make sense to deputize another
> release manager to help get out just the maintenance releases? this may in
> turn mean maintenance branches last longer. Experienced hands can continue
> to manage new minor and major releases as they require more coordination.
>
> I know most of the release process is written down; I know it's also still
> going to be work to make it 100% documented. Eventually it'll be necessary
> to make sure it's entirely codified anyway.
>
> Not pushing for it myself, just noting I had heard this brought up in side
> conversations before.
>
>
> On Mon, Mar 13, 2017 at 7:07 PM Holden Karau  wrote:
>
> Hi Spark Devs,
>
> Spark 2.1 has been out since end of December
> 
> and we've got quite a few fixes merged for 2.1.1
> 
> .
>
> On the Python side one of the things I'd like to see us get out into a
> patch release is a packaging fix (now merged) before we upload to PyPI &
> Conda, and we also have the normal batch of fixes like toLocalIterator for
> large DataFrames in PySpark.
>
> I've chatted with Felix & Shivaram who seem to think the R side is looking
> close to in good shape for a 2.1.1 release to submit to CRAN (if I've
> miss-spoken my apologies). The two outstanding issues that are being
> tracked for R are SPARK-18817, SPARK-19237.
>
> Looking at the other components quickly it seems like structured streaming
> could also benefit from a patch release.
>
> What do others think - are there any issues people are actively targeting
> for 2.1.1? Is this too early to be considering a patch release?
>
> Cheers,
>
> Holden
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>
>
>


Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-03-04 Thread Nick Pentreath
Also, note https://issues.apache.org/jira/browse/SPARK-7146 is linked from
SPARK-19498 specifically to discuss opening up sharedParams traits.


On Fri, 3 Mar 2017 at 23:17 Shouheng Yi 
wrote:

> Hi Spark dev list,
>
>
>
> Thank you guys so much for all your inputs. We really appreciated those
> suggestions. After some discussions in the team, we decided to stay under
> apache’s namespace for now, and attach some comments to explain what we did
> and why we did this.
>
>
>
> As the Spark dev list kindly pointed out, this is an existing issue that
> was documented in the JIRA ticket [Spark-19498] [0]. We can follow the JIRA
> ticket to see if there are any new suggested practices that should be
> adopted in the future and make corresponding fixes.
>
>
>
> Best,
>
> Shouheng
>
>
>
> [0] https://issues.apache.org/jira/browse/SPARK-19498
>
>
>
> *From:* Tim Hunter [mailto:timhun...@databricks.com
> ]
> *Sent:* Friday, February 24, 2017 9:08 AM
> *To:* Joseph Bradley 
> *Cc:* Steve Loughran ; Shouheng Yi <
> sho...@microsoft.com.invalid>; Apache Spark Dev ;
> Markus Weimer ; Rogan Carr ;
> Pei Jiang ; Miruna Oprescu 
> *Subject:* Re: [Spark Namespace]: Expanding Spark ML under Different
> Namespace?
>
>
>
> Regarding logging, Graphframes makes a simple wrapper this way:
>
>
>
>
> https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/Logging.scala
> 
>
>
>
> Regarding the UDTs, they have been hidden to be reworked for Datasets, the
> reasons being detailed here [1]. Can you describe your use case in more
> details? You may be better off copy/pasting the UDT code outside of Spark,
> depending on your use case.
>
>
>
> [1] https://issues.apache.org/jira/browse/SPARK-14155
> 
>
>
>
> On Thu, Feb 23, 2017 at 3:42 PM, Joseph Bradley 
> wrote:
>
> +1 for Nick's comment about discussing APIs which need to be made public
> in https://issues.apache.org/jira/browse/SPARK-19498
> 
> !
>
>
>
> On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran 
> wrote:
>
>
>
> On 22 Feb 2017, at 20:51, Shouheng Yi 
> wrote:
>
>
>
> Hi Spark developers,
>
>
>
> Currently my team at Microsoft is extending Spark’s machine learning
> functionalities to include new learners and transformers. We would like
> users to use these within spark pipelines so that they can mix and match
> with existing Spark learners/transformers, and overall have a native spark
> experience. We cannot accomplish this using a non-“org.apache” namespace
> with the current implementation, and we don’t want to release code inside
> the apache namespace because it’s confusing and there could be naming
> rights issues.
>
>
>
> This isn't actually the ASF has a strong stance against, more left to
> projects themselves. After all: the source is licensed by the ASF, and the
> license doesn't say you can't.
>
>
>
> Indeed, there's a bit of org.apache.hive in the Spark codebase where the
> hive team kept stuff package private. Though that's really a sign that
> things could be improved there.
>
>
>
> Where is problematic is that stack traces end up blaming the wrong group;
> nobody likes getting a bug report which doesn't actually exist in your
> codebase., not least because you have to waste time to even work it out.
>
>
>
> You also have to expect absolutely no stability guarantees, so you'd
> better set your nightly build to work against trunk
>
>
>
> Apache Bahir does put some stuff into org.apache.spark.stream, but they've
> sort of inherited that right.when they picked up the code from spark. new
> stuff is going into org.apache.bahir
>
>
>
>
>
> We need to extend several classes from spark which happen to have
> “private[spark].” For example, one of our class extends VectorUDT[0] which
> has private[spark] class 

Re: Feedback on MLlib roadmap process proposal

2017-02-24 Thread Nick Pentreath
FYI I've started going through a few of the top Watched JIRAs and tried to
identify those that are obviously stale and can probably be closed, to try
to clean things up a bit.

On Thu, 23 Feb 2017 at 21:38 Tim Hunter <timhun...@databricks.com> wrote:

> As Sean wrote very nicely above, the changes made to Spark are decided in
> an organic fashion based on the interests and motivations of the committers
> and contributors. The case of deep learning is a good example. There is a
> lot of interest, and the core algorithms could be implemented without too
> much problem in a few thousands of lines of scala code. However, the
> performance of such a simple implementation would be one to two order of
> magnitude slower than what would get from the popular frameworks out there.
>
> At this point, there are probably more man-hours invested in TensorFlow
> (as an example) than in MLlib, so I think we need to be realistic about
> what we can expect to achieve inside Spark. Unlike BLAS for linear algebra,
> there is no agreed-up interface for deep learning, and each of the XOnSpark
> flavors explores a slightly different design. It will be interesting to see
> what works well in practice. In the meantime, though, there are plenty of
> things that we could do to help developers of other libraries to have a
> great experience with Spark. Matei alluded to that in his Spark Summit
> keynote when he mentioned better integration with low-level libraries.
>
> Tim
>
>
> On Thu, Feb 23, 2017 at 5:32 AM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
> Sorry for being late to the discussion. I think Joseph, Sean and others
> have covered the issues well.
>
> Overall I like the proposed cleaned up roadmap & process (thanks Joseph!).
> As for the actual critical roadmap items mentioned on SPARK-18813, I think
> it makes sense and will comment a bit further on that JIRA.
>
> I would like to encourage votes & watching for issues to give a sense of
> what the community wants (I guess Vote is more explicit yet passive, while
> actually Watching an issue is more informative as it may indicate a real
> use case dependent on the issue?!).
>
> I think if used well this is valuable information for contributors. Of
> course not everything on that list can get done. But if I look through the
> top votes or watch list, while not all of those are likely to go in, a
> great many of the issues are fairly non-contentious in terms of being good
> additions to the project.
>
> Things like these are good examples IMO (I just sample a few of them, not
> exhaustive):
> - sample weights for RF / DT
> - multi-model and/or parallel model selection
> - make sharedParams public?
> - multi-column support for various transformers
> - incremental model training
> - tree algorithm enhancements
>
> Now, whether these can be prioritised in terms of bandwidth available to
> reviewers and committers is a totally different thing. But as Sean mentions
> there is some process there for trying to find the balance of the issue
> being a "good thing to add", a shepherd with bandwidth & interest in the
> issue to review, and the maintenance burden imposed.
>
> Let's take Deep Learning / NN for example. Here's a good example of
> something that has a lot of votes/watchers and as Sean mentions it is
> something that "everyone wants someone else to implement". In this case,
> much of the interest may in fact be "stale" - 2 years ago it would have
> been very interesting to have a strong DL impl in Spark. Now, because there
> are a plethora of very good DL libraries out there, how many of those Votes
> would be "deleted"? Granted few are well integrated with Spark but that can
> and is changing (DL4J, BigDL, the "XonSpark" flavours etc).
>
> So this is something that I dare say will not be in Spark any time in the
> foreseeable future or perhaps ever given the current status. Perhaps it's
> worth seriously thinking about just closing these kind of issues?
>
>
>
> On Fri, 27 Jan 2017 at 05:53 Joseph Bradley <jos...@databricks.com> wrote:
>
> Sean has given a great explanation.  A few more comments:
>
> Roadmap: I have been creating roadmap JIRAs, but the goal really is to
> have all committers working on MLlib help to set that roadmap, based on
> either their knowledge of current maintenance/internal needs of the project
> or the feedback given from the rest of the community.
> @Committers - I see people actively shepherding PRs for MLlib, but I don't
> see many major initiatives linked to the roadmap.  If there are ones large
> enough to merit adding to the roadmap, please do.
>
> In general, there are many process improvements we

Re: Feedback on MLlib roadmap process proposal

2017-02-23 Thread Nick Pentreath
Sorry for being late to the discussion. I think Joseph, Sean and others
have covered the issues well.

Overall I like the proposed cleaned up roadmap & process (thanks Joseph!).
As for the actual critical roadmap items mentioned on SPARK-18813, I think
it makes sense and will comment a bit further on that JIRA.

I would like to encourage votes & watching for issues to give a sense of
what the community wants (I guess Vote is more explicit yet passive, while
actually Watching an issue is more informative as it may indicate a real
use case dependent on the issue?!).

I think if used well this is valuable information for contributors. Of
course not everything on that list can get done. But if I look through the
top votes or watch list, while not all of those are likely to go in, a
great many of the issues are fairly non-contentious in terms of being good
additions to the project.

Things like these are good examples IMO (I just sample a few of them, not
exhaustive):
- sample weights for RF / DT
- multi-model and/or parallel model selection
- make sharedParams public?
- multi-column support for various transformers
- incremental model training
- tree algorithm enhancements

Now, whether these can be prioritised in terms of bandwidth available to
reviewers and committers is a totally different thing. But as Sean mentions
there is some process there for trying to find the balance of the issue
being a "good thing to add", a shepherd with bandwidth & interest in the
issue to review, and the maintenance burden imposed.

Let's take Deep Learning / NN for example. Here's a good example of
something that has a lot of votes/watchers and as Sean mentions it is
something that "everyone wants someone else to implement". In this case,
much of the interest may in fact be "stale" - 2 years ago it would have
been very interesting to have a strong DL impl in Spark. Now, because there
are a plethora of very good DL libraries out there, how many of those Votes
would be "deleted"? Granted few are well integrated with Spark but that can
and is changing (DL4J, BigDL, the "XonSpark" flavours etc).

So this is something that I dare say will not be in Spark any time in the
foreseeable future or perhaps ever given the current status. Perhaps it's
worth seriously thinking about just closing these kind of issues?



On Fri, 27 Jan 2017 at 05:53 Joseph Bradley  wrote:

> Sean has given a great explanation.  A few more comments:
>
> Roadmap: I have been creating roadmap JIRAs, but the goal really is to
> have all committers working on MLlib help to set that roadmap, based on
> either their knowledge of current maintenance/internal needs of the project
> or the feedback given from the rest of the community.
> @Committers - I see people actively shepherding PRs for MLlib, but I don't
> see many major initiatives linked to the roadmap.  If there are ones large
> enough to merit adding to the roadmap, please do.
>
> In general, there are many process improvements we could make.  A few in
> my mind are:
> * Visibility: Let the community know what committers are focusing on.
> This was the primary purpose of the "MLlib roadmap proposal."
> * Community initiatives: This is currently very organic.  Some of the
> organic process could be improved, such as encouraging Votes/Watchers
> (though I agree with Sean about these being one-sided metrics).  Cody's SIP
> work is a great step towards adding more clarity and structure for major
> initiatives.
> * JIRA hygiene: Always a challenge, and always requires some manual
> prodding.  But it's great to push for efforts on this.
>
>
> On Wed, Jan 25, 2017 at 3:59 AM, Sean Owen  wrote:
>
> On Wed, Jan 25, 2017 at 6:01 AM Ilya Matiach  wrote:
>
> My confusion was that the ML 2.2 roadmap critical features (
> https://issues.apache.org/jira/browse/SPARK-18813) did not line up with
> the top ML/MLLIB JIRAs by Votes
> or
> Watchers
> 
> .
>
> Your 

Re: Implementation of RNN/LSTM in Spark

2017-02-23 Thread Nick Pentreath
The short answer is there is none and highly unlikely to be inside of Spark
MLlib any time in the near future.

The best bets are to look at other DL libraries - for JVM there is
Deeplearning4J and BigDL (there are others but these seem to be the most
comprehensive I have come across) - that run on Spark. Also there are
various flavours of TensorFlow / Caffe on Spark. And of course the libs
such as Torch, Keras, Tensorflow, MXNet, Caffe etc. Some of them have Java
or Scala APIs and some form of Spark integration out there in the community
(in varying states of development).

Integrations with Spark are a bit patchy currently but include the
"XOnSpark" flavours mentioned above and TensorFrames (again, there may be
others).

On Thu, 23 Feb 2017 at 14:23 n1kt0  wrote:

> Hi,
> can anyone tell me what the current status about RNNs in Spark is?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Implementation-of-RNN-LSTM-in-Spark-tp14866p21060.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Google Summer of Code 2017 is coming

2017-02-05 Thread Nick Pentreath
I think Sean raises valid points - that the result is highly dependent on
the particular student, project and mentor involved, and that the actual
required time investment is very significant.

Having said that, it's not all bad certainly. Scikit-learn started as a
GSoC project 10 years ago!

Actually they have a pretty good model for accepting students - typically
the student must demonstrate significant prior knowledge and ability with
the project sufficient to complete the work.

The challenge I think Spark has is already folks are strapped for capacity
so finding mentors with time will be tricky. But if there are mentors and
the right project / student fit can be found, I think it's a good idea.


On Sat, 4 Feb 2017 at 01:22 Jacek Laskowski  wrote:

> Thanks Sean. You've again been very helpful to put the right tone to
> the matters. I stand corrected and have no interest in GSoC anymore.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Feb 3, 2017 at 11:38 PM, Sean Owen  wrote:
> > I have a contrarian opinion on GSoC from experience many years ago in
> > Mahout. Of 3 students I interacted with, 2 didn't come close to
> completing
> > the work they signed up for. I think it's mostly that students are hungry
> > for the resumé line item, and don't understand the amount of work they're
> > proposing, and ultimately have little incentive to complete their
> proposal.
> > The stipend is small.
> >
> > I can appreciate the goal of GSoC but it makes more sense for projects
> that
> > don't get as much attention, and Spark gets plenty. I would not expect
> > students to be able to be net contributors to a project like Spark. The
> time
> > they consume in hand-holding will exceed the time it would take for
> someone
> > experienced to just do the work. I would caution anyone from agreeing to
> > this for Spark unless they are willing to devote 5-10 hours per week for
> the
> > summer to helping someone learn.
> >
> > My net experience with GSoC is negative, mostly on account of the
> > participants.
> >
> > On Fri, Feb 3, 2017 at 9:56 PM Holden Karau 
> wrote:
> >>
> >> As someone who did GSoC back in University I think this could be a good
> >> idea if there is enough interest from the PMC & I'd be willing the help
> >> mentor if that is a bottleneck.
> >>
> >> On Fri, Feb 3, 2017 at 12:42 PM, Jacek Laskowski 
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Is this something Spark considering? Would be nice to mark issues as
> >>> GSoC in JIRA and solicit feedback. What do you think?
> >>>
> >>> Pozdrawiam,
> >>> Jacek Laskowski
> >>> 
> >>> https://medium.com/@jaceklaskowski/
> >>> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> >>> Follow me at https://twitter.com/jaceklaskowski
> >>>
> >>>
> >>>
> >>> -- Forwarded message --
> >>> From: Ulrich Stärk 
> >>> Date: Fri, Feb 3, 2017 at 8:50 PM
> >>> Subject: Google Summer of Code 2017 is coming
> >>> To: ment...@community.apache.org
> >>>
> >>>
> >>> Hello PMCs (incubator Mentors, please forward this email to your
> >>> podlings),
> >>>
> >>> Google Summer of Code [1] is a program sponsored by Google allowing
> >>> students to spend their summer
> >>> working on open source software. Students will receive stipends for
> >>> developing open source software
> >>> full-time for three months. Projects will provide mentoring and
> >>> project ideas, and in return have
> >>> the chance to get new code developed and - most importantly - to
> >>> identify and bring in new committers.
> >>>
> >>> The ASF will apply as a participating organization meaning individual
> >>> projects don't have to apply
> >>> separately.
> >>>
> >>> If you want to participate with your project we ask you to do the
> >>> following things as soon as
> >>> possible but by no later than 2017-02-09:
> >>>
> >>> 1. understand what it means to be a mentor [2].
> >>>
> >>> 2. record your project ideas.
> >>>
> >>> Just create issues in JIRA, label them with gsoc2017, and they will
> >>> show up at [3]. Please be as
> >>> specific as possible when describing your idea. Include the
> >>> programming language, the tools and
> >>> skills required, but try not to scare potential students away. They
> >>> are supposed to learn what's
> >>> required before the program starts.
> >>>
> >>> Use labels, e.g. for the programming language (java, c, c++, erlang,
> >>> python, brainfuck, ...) or
> >>> technology area (cloud, xml, web, foo, bar, ...) and record them at
> [5].
> >>>
> >>> Please use the COMDEV JIRA project for recording your ideas if your
> >>> project doesn't use JIRA (e.g.
> >>> httpd, ooo). Contact d...@community.apache.org if you need assistance.
> >>>
> >>> [4] contains some additional information (will be updated 

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Nick Pentreath
Hi Maciej

If you're seeing a regression from 1.6 -> 2.0 *both using DataFrames *then
that seems to point to some other underlying issue as the root cause.

Even though adding checkpointing should help, we should understand why it's
different between 1.6 and 2.0?


On Thu, 2 Feb 2017 at 08:22 Liang-Chi Hsieh  wrote:

>
> Hi Maciej,
>
> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>
>
> Liang-Chi Hsieh wrote
> > Hi Maciej,
> >
> > Basically the fitting algorithm in Pipeline is an iterative operation.
> > Running iterative algorithm on Dataset would have RDD lineages and query
> > plans that grow fast. Without cache and checkpoint, it gets slower when
> > the iteration number increases.
> >
> > I think it is why when you run a Pipeline with long stages, it gets much
> > longer time to finish. As I think it is not uncommon to have long stages
> > in a Pipeline, we should improve this. I will submit a PR for this.
> > zero323 wrote
> >> Hi everyone,
> >>
> >> While experimenting with ML pipelines I experience a significant
> >> performance regression when switching from 1.6.x to 2.x.
> >>
> >> import org.apache.spark.ml.{Pipeline, PipelineStage}
> >> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
> >> VectorAssembler}
> >>
> >> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
> >> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
> >> val indexers = df.columns.tail.map(c => new StringIndexer()
> >>   .setInputCol(c)
> >>   .setOutputCol(s"${c}_indexed")
> >>   .setHandleInvalid("skip"))
> >>
> >> val encoders = indexers.map(indexer => new OneHotEncoder()
> >>   .setInputCol(indexer.getOutputCol)
> >>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
> >>   .setDropLast(true))
> >>
> >> val assembler = new
> >> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
> >> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
> >>
> >> new Pipeline().setStages(stages).fit(df).transform(df).show
> >>
> >> Task execution time is comparable and executors are most of the time
> >> idle so it looks like it is a problem with the optimizer. Is it a known
> >> issue? Are there any changes I've missed, that could lead to this
> >> behavior?
> >>
> >> --
> >> Best,
> >> Maciej
> >>
> >>
> >> -
> >> To unsubscribe e-mail:
>
> >> dev-unsubscribe@.apache
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Nick Pentreath
Yup - it's because almost all model data in spark ML (model coefficients)
is "small" - i.e. Non distributed.

If you look at ALS you'll see there is no repartitioning since the factor
dataframes can be large
On Fri, 13 Jan 2017 at 19:42, Sean Owen  wrote:

> You're referring to code that serializes models, which are quite small.
> For example a PCA model consists of a few principal component vector. It's
> a Dataset of just one element being saved here. It's re-using the code path
> normally used to save big data sets, to output 1 file with 1 thing as
> Parquet.
>
> On Fri, Jan 13, 2017 at 5:29 PM Asher Krim  wrote:
>
> But why is that beneficial? The data is supposedly quite large,
> distributing it across many partitions/files would seem to make sense.
>
> On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen  wrote:
>
> That is usually so the result comes out in one file, not partitioned over
> n files.
>
> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim  wrote:
>
> Hi,
>
> I'm curious why it's common for data to be repartitioned to 1 partition
> when saving ml models:
>
> sqlContext.createDataFrame(Seq(data)).repartition(1
> ).write.parquet(dataPath)
>
> This shows up in most ml models I've seen (Word2Vec
> ,
> PCA
> ,
> LDA
> ).
> Am I missing some benefit of repartitioning like this?
>
> Thanks,
> --
> Asher Krim
> Senior Software Engineer
>
>
>
>
> --
> Asher Krim
> Senior Software Engineer
>
>


Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread Nick Pentreath
Yes most likely due to hashing tf returns ml vectors while you need mllib
vectors for row matrix.

I'd recommend using the vector conversion utils (I think in
mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly).
There are until methods for converting single vectors as well as vector
rows of DF. Check the mllib user guide for 2.0 for details.
On Fri, 9 Dec 2016 at 04:42, satyajit vegesna 
wrote:

> Hi All,
>
> PFB code.
>
>
> import org.apache.spark.ml.feature.{HashingTF, IDF}
> import org.apache.spark.ml.linalg.SparseVector
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.{SparkConf, SparkContext}
>
> /**
>   * Created by satyajit on 12/7/16.
>   */
> object DIMSUMusingtf extends App {
>
>   val conf = new SparkConf()
> .setMaster("local[1]")
> .setAppName("testColsim")
>   val sc = new SparkContext(conf)
>   val spark = SparkSession
> .builder
> .appName("testColSim").getOrCreate()
>
>   import org.apache.spark.ml.feature.Tokenizer
>
>   val sentenceData = spark.createDataFrame(Seq(
> (0, "Hi I heard about Spark"),
> (0, "I wish Java could use case classes"),
> (1, "Logistic regression models are neat")
>   )).toDF("label", "sentence")
>
>   val tokenizer = new 
> Tokenizer().setInputCol("sentence").setOutputCol("words")
>
>   val wordsData = tokenizer.transform(sentenceData)
>
>
>   val hashingTF = new HashingTF()
> .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
>
>   val featurizedData = hashingTF.transform(wordsData)
>
>
>   val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
>   val idfModel = idf.fit(featurizedData)
>   val rescaledData = idfModel.transform(featurizedData)
>   rescaledData.show()
>   rescaledData.select("features", "label").take(3).foreach(println)
>   val check = rescaledData.select("features")
>
>   val row = check.rdd.map(row => row.getAs[SparseVector]("features"))
>
>   val mat = new RowMatrix(row) //i am basically trying to use Dense.vector as 
> a direct input to
>
> rowMatrix, but i get an error that RowMatrix Cannot resolve constructor
>
>   row.foreach(println)
> }
>
> Any help would be appreciated.
>
> Regards,
> Satyajit.
>
>
>
>


Re: 2.1.0-rc2 cut; committers please set fix version for branch-2.1 to 2.1.1 instead

2016-12-07 Thread Nick Pentreath
I went ahead and re-marked all the existing 2.1.1 fix version JIRAs (that
had gone into branch-2.1 since RC1 but before RC2) for Spark ML to 2.1.0

On Thu, 8 Dec 2016 at 09:20 Reynold Xin  wrote:

> Thanks.
>


Re: unhelpful exception thrown on predict() when ALS trained model doesn't contain user or product?

2016-12-06 Thread Nick Pentreath
Indeed, it's being tracked here:
https://issues.apache.org/jira/browse/SPARK-18230 though no Pr has been
opened yet.

On Tue, 6 Dec 2016 at 13:36 chris snow  wrote:

> I'm using the MatrixFactorizationModel.predict() method and encountered
> the following exception:
>
> Name: java.util.NoSuchElementException
> Message: next on empty iterator
> StackTrace: scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
> scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
> scala.collection.IterableLike$class.head(IterableLike.scala:91)
>
> scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47)
>
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
> scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47)
>
> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:81)
>
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:74)
>
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:79)
>
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:81)
>
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:83)
>
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:85)
>
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:87)
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:89)
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:91)
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:93)
> $line78.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:95)
> $line78.$read$$iwC$$iwC$$iwC$$iwC.(:97)
> $line78.$read$$iwC$$iwC$$iwC.(:99)
> $line78.$read$$iwC$$iwC.(:101)
> $line78.$read$$iwC.(:103)
> $line78.$read.(:105)
> $line78.$read$.(:109)
> $line78.$read$.()
> $line78.$eval$.(:7)
> $line78.$eval$.()
> $line78.$eval.$print()
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
> java.lang.reflect.Method.invoke(Method.java:507)
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>
> com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$interpretAddTask$1$$anonfun$apply$3.apply(ScalaInterpreter.scala:296)
>
> com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$interpretAddTask$1$$anonfun$apply$3.apply(ScalaInterpreter.scala:291)
> com.ibm.spark.global.StreamState$.withStreams(StreamState.scala:80)
>
> com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$interpretAddTask$1.apply(ScalaInterpreter.scala:290)
>
> com.ibm.spark.interpreter.ScalaInterpreter$$anonfun$interpretAddTask$1.apply(ScalaInterpreter.scala:290)
>
> com.ibm.spark.utils.TaskManager$$anonfun$add$2$$anon$1.run(TaskManager.scala:123)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> java.lang.Thread.run(Thread.java:785)
>
> This took some debugging to figure out why I received the Exception, but
> when looking at the predict() implementation, I seems to assume that there
> will always be features found for the provided user and product ids:
>
>
>   /** Predict the rating of one user for one product. */
>   @Since("0.8.0")
>   def predict(user: Int, product: Int): Double = {
> val userVector = userFeatures.lookup(user).head
> val productVector = productFeatures.lookup(product).head
> blas.ddot(rank, userVector, 1, productVector, 1)
>   }
>
> It would be helpful if a more useful exception was raised, e.g.
>
> MissingUserFeatureException : "User ID ${user} not found in model"
> MissingProductFeatureException : "Product ID ${product} not found in model"
>
> WDYT?
>
>
>
>


Re: Why don't we imp some adaptive learning rate methods, such as adadelat, adam?

2016-11-30 Thread Nick Pentreath
check out https://github.com/VinceShieh/Spark-AdaOptimizer

On Wed, 30 Nov 2016 at 10:52 WangJianfei 
wrote:

> Hi devs:
> Normally, the adaptive learning rate methods can have a fast
> convergence
> then standard SGD, so why don't we imp them?
> see the link for more details
> http://sebastianruder.com/optimizing-gradient-descent/index.html#adadelta
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-don-t-we-imp-some-adaptive-learning-rate-methods-such-as-adadelat-adam-tp20057.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Nick Pentreath
@Holden look forward to the blog post - I think a user guide PR based on it
would also be super useful :)

On Fri, 18 Nov 2016 at 05:29 Holden Karau  wrote:

> I've been working on a blog post around this and hope to have it published
> early next month 
>
> On Nov 17, 2016 10:16 PM, "Joseph Bradley"  wrote:
>
> Hi Georg,
>
> It's true we need better documentation for this.  I'd recommend checking
> out simple algorithms within Spark for examples:
> ml.feature.Tokenizer
> ml.regression.IsotonicRegression
>
> You should not need to put your library in Spark's namespace.  The shared
> Params in SPARK-7146 are not necessary to create a custom algorithm; they
> are just niceties.
>
> Though there aren't great docs yet, you should be able to follow existing
> examples.  And I'd like to add more docs in the future!
>
> Good luck,
> Joseph
>
> On Wed, Nov 16, 2016 at 6:29 AM, Georg Heiler 
> wrote:
>
> HI,
>
> I want to develop a library with custom Estimator / Transformers for
> spark. So far not a lot of documentation could be found but
> http://stackoverflow.com/questions/37270446/how-to-roll-a-custom-estimator-in-pyspark-mllib
>
>
> Suggest that:
> Generally speaking, there is no documentation because as for Spark 1.6 /
> 2.0 most of the related API is not intended to be public. It should change
> in Spark 2.1.0 (see SPARK-7146
> ).
>
> Where can I already find documentation today?
> Is it true that my library would require residing in Sparks`s namespace
> similar to https://github.com/collectivemedia/spark-ext to utilize all
> the handy functionality?
>
> Kind Regards,
> Georg
>
>
>
>


Re: Question about using collaborative filtering in MLlib

2016-11-03 Thread Nick Pentreath
I have a PR for it - https://github.com/apache/spark/pull/12574

Sadly I've been tied up and haven't had a chance to work further on it.

The main issue outstanding is deciding on the transform semantics as well
as performance testing.

Any comments / feedback welcome especially on transform semantics.

N


Re: Is RankingMetrics' NDCG implementation correct?

2016-09-20 Thread Nick Pentreath
(cc'ing dev list also)

I think a more general version of ranking metrics that allows arbitrary
relevance scores could be useful. Ranking metrics are applicable to other
settings like search or other learning-to-rank use cases, so it should be a
little more generic than pure recommender settings.

The one issue with the proposed implementation is that it is not compatible
with the existing cross-validators within a pipeline.

As I've mentioned on the linked JIRAs & PRs, one option is to create a
special set of cross-validators for recommenders, that address the issues
of (a) dataset splitting specific to recommender settings (user-based
stratified sampling, time-based etc) and (b) ranking-based evaluation.

The other option is to have the ALSModel itself capable of generating the
"ground-truth" set within the same dataframe output from "transform" (ie
predict top k) that can be fed into the cross-validator (with
RankingEvaluator) directly. That's the approach I took so far in
https://github.com/apache/spark/pull/12574.

Both options are valid and have their positives & negatives - open to
comments / suggestions.

On Tue, 20 Sep 2016 at 06:08 Jong Wook Kim  wrote:

> Thanks for the clarification and the relevant links. I overlooked the
> comments explicitly saying that the relevance is binary.
>
> I understand that the label is not a relevance, but I have been, and I
> think many people are using the label as relevance in the implicit feedback
> context where the user-provided exact label is not defined anyway. I think
> that's why RiVal 's using the term
> "preference" for both the label for MAE and the relevance for NDCG.
>
> At the same time, I see why Spark decided to assume the relevance is
> binary, in part to conform to the class RankingMetrics's constructor. I
> think it would be nice if the upcoming DataFrame-based RankingEvaluator can
> be optionally set a "relevance column" that has non-binary relevance
> values, otherwise defaulting to either 1.0 or the label column.
>
> My extended version of RankingMetrics is here:
> https://github.com/jongwook/spark-ranking-metrics . It has a test case
> checking that the numbers are same as RiVal's.
>
> Jong Wook
>
>
>
> On 19 September 2016 at 03:13, Sean Owen  wrote:
>
>> Yes, relevance is always 1. The label is not a relevance score so
>> don't think it's valid to use it as such.
>>
>> On Mon, Sep 19, 2016 at 4:42 AM, Jong Wook Kim  wrote:
>> > Hi,
>> >
>> > I'm trying to evaluate a recommendation model, and found that Spark and
>> > Rival give different results, and it seems that Rival's one is what
>> Kaggle
>> > defines:
>> https://gist.github.com/jongwook/5d4e78290eaef22cb69abbf68b52e597
>> >
>> > Am I using RankingMetrics in a wrong way, or is Spark's implementation
>> > incorrect?
>> >
>> > To my knowledge, NDCG should be dependent on the relevance (or
>> preference)
>> > values, but Spark's implementation seems not; it uses 1.0 where it
>> should be
>> > 2^(relevance) - 1, probably assuming that relevance is all 1.0? I also
>> tried
>> > tweaking, but its method to obtain the ideal DCG also seems wrong.
>> >
>> > Any feedback from MLlib developers would be appreciated. I made a
>> > modified/extended version of RankingMetrics that produces the identical
>> > numbers to Kaggle and Rival's results, and I'm wondering if it is
>> something
>> > appropriate to be added back to MLlib.
>> >
>> > Jong Wook
>>
>
>


Re: Organizing Spark ML example packages

2016-09-12 Thread Nick Pentreath
Never actually got around to doing this - do folks still think it
worthwhile?

On Thu, 21 Apr 2016 at 00:10 Joseph Bradley <jos...@databricks.com> wrote:

> Sounds good to me.  I'd request we be strict during this process about
> requiring *no* changes to the example itself, which will make review easier.
>
> On Tue, Apr 19, 2016 at 11:12 AM, Bryan Cutler <cutl...@gmail.com> wrote:
>
>> +1, adding some organization would make it easier for people to find a
>> specific example
>>
>> On Mon, Apr 18, 2016 at 11:52 PM, Yanbo Liang <yblia...@gmail.com> wrote:
>>
>>> This sounds good to me, and it will make ML examples more neatly.
>>>
>>> 2016-04-14 5:28 GMT-07:00 Nick Pentreath <nick.pentre...@gmail.com>:
>>>
>>>> Hey Spark devs
>>>>
>>>> I noticed that we now have a large number of examples for ML & MLlib in
>>>> the examples project - 57 for ML and 67 for MLLIB to be precise. This is
>>>> bound to get larger as we add features (though I know there are some PRs to
>>>> clean up duplicated examples).
>>>>
>>>> What do you think about organizing them into packages to match the use
>>>> case and the structure of the code base? e.g.
>>>>
>>>> org.apache.spark.examples.ml.recommendation
>>>>
>>>> org.apache.spark.examples.ml.feature
>>>>
>>>> and so on...
>>>>
>>>> Is it worth doing? The doc pages with include_example would need
>>>> updating, and the run_example script input would just need to change the
>>>> package slightly. Did I miss any potential issue?
>>>>
>>>> N
>>>>
>>>
>>>
>>
>


Re: Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nick Pentreath
It's not impossible that a Transformer could output multiple columns - it's
simply because none of the current ones do. It's true that it might be a
relatively less common use case in general.

But take StringIndexer for example. It turns strings (categorical features)
into ints (0-based indexes). It could (should) accept multiple input
columns for efficiency (see
https://issues.apache.org/jira/browse/SPARK-11215). This is a case where
multiple output columns would be required.

N

On Tue, 23 Aug 2016 at 16:15 Nicholas Chammas 
wrote:

> If you create your own Spark 2.x ML Transformer, there are multiple
> mix-ins (is that the correct term?) that you can use to define its behavior
> which are in ml/param/shared.py
> 
> .
>
> Among them are the following mix-ins:
>
>- HasInputCol
>- HasInputCols
>- HasOutputCol
>
> What’s *not* available is a HasOutputCols mix-in, and I assume that is
> intentional.
>
> Is there a design reason why Transformers should not be able to define
> multiple output columns?
>
> I’m guessing if you are an ML beginner who thinks they need a Transformer
> with multiple output columns, you’ve misunderstood something. 
>
> Nick
> ​
>


Re: Java 8

2016-08-20 Thread Nick Pentreath
Spark already supports compiling with Java 8. What refactoring are you
referring to, and where do you expect to see performance gains?

On Sat, 20 Aug 2016 at 12:41, Timur Shenkao  wrote:

> Hello, guys!
>
> Are there any plans / tickets / branches in repository on Java 8?
>
> I ask because ML library will gain in performance. I'd like to take part
> in refactoring.
>


Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Nick Pentreath
Currently there is no direct way in Spark to serve models without bringing
in all of Spark as a dependency.

For Spark ML, there is actually no way to do it independently of DataFrames
either (which for single-instance prediction makes things sub-optimal).
That is covered here: https://issues.apache.org/jira/browse/SPARK-10413

So, your options are (in Scala) things like MLeap, PredictionIO, or "roll
your own". Or you can try to export to some other format such as PMML or
PFA. Some MLlib models support PMML export, but for ML it is still missing
(see https://issues.apache.org/jira/browse/SPARK-11171).

There is an external project for PMML too (note licensing) -
https://github.com/jpmml/jpmml-sparkml - which is by now actually quite
comprehensive. It shows that PMML can represent a pretty large subset of
typical ML pipeline functionality.

On the Python side sadly there is even less - I would say your options are
pretty much "roll your own" currently, or export in PMML or PFA.

Finally, part of the "mllib-local" idea was around enabling this local
model-serving (for some initial discussion about the future see
https://issues.apache.org/jira/browse/SPARK-16365).

N

On Thu, 11 Aug 2016 at 06:28 Michael Allman  wrote:

> Nick,
>
> Check out MLeap: https://github.com/TrueCar/mleap. It's not python, but
> we use it in production to serve a random forest model trained by a Spark
> ML pipeline.
>
> Thanks,
>
> Michael
>
> On Aug 10, 2016, at 7:50 PM, Nicholas Chammas 
> wrote:
>
> Are there any existing JIRAs covering the possibility of serving up Spark
> ML models via, for example, a regular Python web app?
>
> The story goes like this: You train your model with Spark on several TB of
> data, and now you want to use it in a prediction service that you’re
> building, say with Flask . In principle, you
> don’t need Spark anymore since you’re just passing individual data points
> to your model and looking for it to spit some prediction back.
>
> I assume this is something people do today, right? I presume Spark needs
> to run in their web service to serve up the model. (Sorry, I’m new to the
> ML side of Spark. )
>
> Are there any JIRAs discussing potential improvements to this story? I did
> a search, but I’m not sure what exactly to look for. SPARK-4587
>  (model import/export)
> looks relevant, but doesn’t address the story directly.
>
> Nick
> ​
>
>
>


Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-02 Thread Nick Pentreath
Note that both HashingTF and CountVectorizer are usually used for creating
TF-IDF normalized vectors. The definition (
https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency
in TF-IDF is actually the "number of times the term occurs in the document".

So it's perhaps a bit of a misnomer, but the implementation is correct.

On Tue, 2 Aug 2016 at 05:44 Yanbo Liang  wrote:

> Hi Hao,
>
> HashingTF directly apply a hash function (Murmurhash3) to the features to
> determine their column index. It excluded any thought about the term
> frequency or the length of the document. It does similar work compared with
> sklearn FeatureHasher. The result is increased speed and reduced memory
> usage, but it does not remember what the input features looked like and can
> not convert the output back to the original features. Actually we misnamed
> this transformer, it only does the work of feature hashing rather than
> computing hashing term frequency.
>
> CountVectorizer will select the top vocabSize words ordered by term
> frequency across the corpus to build the hash table of the features. So it
> will consume more memory than HashingTF. However, we can convert the output
> back to the original feature.
>
> Both of the transformers do not consider the length of each document. If
> you want to compute term frequency divided by the length of the document,
> you should write your own function based on transformers provided by MLlib.
>
> Thanks
> Yanbo
>
> 2016-08-01 15:29 GMT-07:00 Hao Ren :
>
>> When computing term frequency, we can use either HashTF or
>> CountVectorizer feature extractors.
>> However, both of them just use the number of times that a term appears in
>> a document.
>> It is not a true frequency. Acutally, it should be divided by the length
>> of the document.
>>
>> Is this a wanted feature ?
>>
>> --
>> Hao Ren
>>
>> Data Engineer @ leboncoin
>>
>> Paris, France
>>
>
>


Re: Internal Deprecation warnings - worth fixing?

2016-07-27 Thread Nick Pentreath
+1 I don't believe there's any reason for the warnings to still be there
except for available dev time & focus :)

On Wed, 27 Jul 2016 at 21:35, Jacek Laskowski  wrote:

> Kill 'em all -- one by one slowly yet gradually! :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Wed, Jul 27, 2016 at 9:11 PM, Holden Karau 
> wrote:
> > Now that the 2.0 release is out the door and I've got some cycles to do
> some
> > cleanups -  I'd like to know what other people think of the internal
> > deprecation warnings we've introduced in a lot of a places in our code.
> Once
> > before I did some minor refactoring so the Python code which had to use
> the
> > deprecated code to expose the deprecated API wouldn't gum up the build
> logs
> > - but is there interest in doing that or are we more interested in not
> > paying attention to the deprecation warnings for internal Spark
> components
> > (e.g. https://twitter.com/thepracticaldev/status/725769766603001856 )?
> >
> >
> > --
> > Cell : 425-233-8271
> > Twitter: https://twitter.com/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-07-04 Thread Nick Pentreath
Hey everyone,

Is there an idea for updated timeline for cutting a next RC? Do we have a
clear picture of outstanding issues? I see 21 issues marked Blocker or
Critical targeted at 2.0.0.

The only blockers I see on JIRA are related to MLlib doc updates etc (I
will go through a few of these to clean them up and see where they stand).
If there are other blockers then we should mark them as such to help
tracking progress?


On Tue, 28 Jun 2016 at 11:28 Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> I take it there will be another RC due to some blockers and as there were
> no +1 votes anyway.
>
> FWIW, I cannot run python tests using "./python/run-tests".
>
> I'd be -1 for this reason (see https://github.com/apache/spark/pull/13737 /
> http://issues.apache.org/jira/browse/SPARK-15954) - does anyone else
> encounter this?
>
> ./python/run-tests --python-executables=python2.7
> Running PySpark tests. Output is in
> /Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/unit-tests.log
> Will test against the following Python executables: ['python2.7']
> Will test the following Python modules: ['pyspark-core', 'pyspark-ml',
> 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> ==
> ERROR: setUpClass (pyspark.sql.tests.HiveContextSQLTests)
> --
> Traceback (most recent call last):
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/sql/tests.py",
> line 1620, in setUpClass
> cls.spark = HiveContext._createForTesting(cls.sc)
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/sql/context.py",
> line 490, in _createForTesting
> jtestHive =
> sparkContext._jvm.org.apache.spark.sql.hive.test.TestHiveContext(jsc)
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
> line 1183, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
> line 312, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling
> None.org.apache.spark.sql.hive.test.TestHiveContext.
> : java.lang.NullPointerException
> at
> org.apache.spark.sql.hive.test.TestHiveSparkSession.getHiveFile(TestHive.scala:183)
> at
> org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:214)
> at
> org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:122)
> at org.apache.spark.sql.hive.test.TestHiveContext.(TestHive.scala:77)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
>
>
> ==
> ERROR: setUpClass (pyspark.sql.tests.SQLTests)
> --
> Traceback (most recent call last):
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/sql/tests.py",
> line 189, in setUpClass
> ReusedPySparkTestCase.setUpClass()
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/tests.py",
> line 344, in setUpClass
> cls.sc = SparkContext('local[4]', cls.__name__)
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/context.py",
> line 112, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File
> "/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/context.py",
> line 261, in _ensure_initialized
> callsite.function, callsite.file, callsite.linenum))
> ValueError: Cannot run multiple SparkCont

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-28 Thread Nick Pentreath
I take it there will be another RC due to some blockers and as there were
no +1 votes anyway.

FWIW, I cannot run python tests using "./python/run-tests".

I'd be -1 for this reason (see https://github.com/apache/spark/pull/13737 /
http://issues.apache.org/jira/browse/SPARK-15954) - does anyone else
encounter this?

./python/run-tests --python-executables=python2.7
Running PySpark tests. Output is in
/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml',
'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
==
ERROR: setUpClass (pyspark.sql.tests.HiveContextSQLTests)
--
Traceback (most recent call last):
  File
"/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/sql/tests.py",
line 1620, in setUpClass
cls.spark = HiveContext._createForTesting(cls.sc)
  File
"/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/sql/context.py",
line 490, in _createForTesting
jtestHive =
sparkContext._jvm.org.apache.spark.sql.hive.test.TestHiveContext(jsc)
  File
"/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
line 1183, in __call__
answer, self._gateway_client, None, self._fqn)
  File
"/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
line 312, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling
None.org.apache.spark.sql.hive.test.TestHiveContext.
: java.lang.NullPointerException
at
org.apache.spark.sql.hive.test.TestHiveSparkSession.getHiveFile(TestHive.scala:183)
at
org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:214)
at
org.apache.spark.sql.hive.test.TestHiveSparkSession.(TestHive.scala:122)
at org.apache.spark.sql.hive.test.TestHiveContext.(TestHive.scala:77)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)


==
ERROR: setUpClass (pyspark.sql.tests.SQLTests)
--
Traceback (most recent call last):
  File
"/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/sql/tests.py",
line 189, in setUpClass
ReusedPySparkTestCase.setUpClass()
  File
"/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/tests.py",
line 344, in setUpClass
cls.sc = SparkContext('local[4]', cls.__name__)
  File
"/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/context.py",
line 112, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File
"/Users/nick/workspace/scala/spark-rcs/spark-2.0.0/python/pyspark/context.py",
line 261, in _ensure_initialized
callsite.function, callsite.file, callsite.linenum))
ValueError: Cannot run multiple SparkContexts at once; existing
SparkContext(app=ReusedPySparkTestCase, master=local[4]) created by
 at /Users/nick/miniconda2/lib/python2.7/runpy.py:72

--
Ran 4 tests in 4.800s

FAILED (errors=2)

Had test failures in pyspark.sql.tests with python2.7; see logs.


On Mon, 27 Jun 2016 at 20:13 Egor Pahomov <pahomov.e...@gmail.com> wrote:

> -1 : SPARK-16228 [SQL]  - "Percentile" needs explicit cast to double,
> otherwise it throws an error. I can not move my existing 100500 quires to
> 2.0 transparently.
>
> 2016-06-24 11:52 GMT-07:00 Matt Cheah <mch...@palantir.com>:
>
>> -1 because of SPARK-16181 which is a correctness regression from 1.6.
>> Looks like the patch is ready though:
>> https://github.com/apache/spark/pull/13884 – it would be ideal for this
>> patch to make it into the release.
>>
>&

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-24 Thread Nick Pentreath
I'm getting the following when trying to run ./dev/run-tests (not happening
on master) from the extracted source tar. Anyone else seeing this?

error: Could not access 'fc0a1475ef'
**
File "./dev/run-tests.py", line 69, in
__main__.identify_changed_files_from_git_commits
Failed example:
[x.name for x in determine_modules_for_files(
identify_changed_files_from_git_commits("fc0a1475ef",
target_ref="5da21f07"))]
Exception raised:
Traceback (most recent call last):
  File "/Users/nick/miniconda2/lib/python2.7/doctest.py", line 1315, in
__run
compileflags, 1) in test.globs
  File "",
line 1, in 
[x.name for x in determine_modules_for_files(
identify_changed_files_from_git_commits("fc0a1475ef",
target_ref="5da21f07"))]
  File "./dev/run-tests.py", line 86, in
identify_changed_files_from_git_commits
universal_newlines=True)
  File "/Users/nick/miniconda2/lib/python2.7/subprocess.py", line 573,
in check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['git', 'diff', '--name-only',
'fc0a1475ef', '5da21f07']' returned non-zero exit status 1
error: Could not access '50a0496a43'
**
File "./dev/run-tests.py", line 71, in
__main__.identify_changed_files_from_git_commits
Failed example:
'root' in [x.name for x in determine_modules_for_files(
 identify_changed_files_from_git_commits("50a0496a43",
target_ref="6765ef9"))]
Exception raised:
Traceback (most recent call last):
  File "/Users/nick/miniconda2/lib/python2.7/doctest.py", line 1315, in
__run
compileflags, 1) in test.globs
  File "",
line 1, in 
'root' in [x.name for x in determine_modules_for_files(
 identify_changed_files_from_git_commits("50a0496a43",
target_ref="6765ef9"))]
  File "./dev/run-tests.py", line 86, in
identify_changed_files_from_git_commits
universal_newlines=True)
  File "/Users/nick/miniconda2/lib/python2.7/subprocess.py", line 573,
in check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['git', 'diff', '--name-only',
'50a0496a43', '6765ef9']' returned non-zero exit status 1
**
1 items had failures:
   2 of   2 in __main__.identify_changed_files_from_git_commits
***Test Failed*** 2 failures.



On Fri, 24 Jun 2016 at 06:59 Yin Huai  wrote:

> -1 because of https://issues.apache.org/jira/browse/SPARK-16121.
>
> This jira was resolved after 2.0.0-RC1 was cut. Without the fix, Spark
> SQL effectively only uses the driver to list files when loading datasets
> and the driver-side file listing is very slow for datasets having many
> files and partitions. Since this bug causes a serious performance
> regression, I am giving -1.
>
> On Thu, Jun 23, 2016 at 1:25 AM, Pete Robbins  wrote:
>
>> I'm also seeing some of these same failures:
>>
>> - spilling with compression *** FAILED ***
>> I have seen this occassionaly
>>
>> - to UTC timestamp *** FAILED ***
>> This was fixed yesterday in branch-2.0 (
>> https://issues.apache.org/jira/browse/SPARK-16078)
>>
>> - offset recovery *** FAILED ***
>> Haven't seen this for a while and thought the flaky test was fixed but it
>> popped up again in one of our builds.
>>
>> StateStoreSuite:
>> - maintenance *** FAILED ***
>> Just seen this has been failing for last 2 days on one build machine
>> (linux amd64)
>>
>>
>> On 23 June 2016 at 08:51, Sean Owen  wrote:
>>
>>> First pass of feedback on the RC: all the sigs, hashes, etc are fine.
>>> Licensing is up to date to the best of my knowledge.
>>>
>>> I'm hitting test failures, some of which may be spurious. Just putting
>>> them out there to see if they ring bells. This is Java 8 on Ubuntu 16.
>>>
>>>
>>> - spilling with compression *** FAILED ***
>>>   java.lang.Exception: Test failed with compression using codec
>>> org.apache.spark.io.SnappyCompressionCodec:
>>> assertion failed: expected cogroup to spill, but did not
>>>   at scala.Predef$.assert(Predef.scala:170)
>>>   at org.apache.spark.TestUtils$.assertSpilled(TestUtils.scala:170)
>>>   at org.apache.spark.util.collection.ExternalAppendOnlyMapSuite.org
>>> $apache$spark$util$collection$ExternalAppendOnlyMapSuite$$testSimpleSpilling(ExternalAppendOnlyMapSuite.scala:263)
>>> ...
>>>
>>> I feel like I've seen this before, and see some possibly relevant
>>> fixes, but they're in 2.0.0 already:
>>> https://github.com/apache/spark/pull/10990
>>> Is this something where a native library needs to be installed or
>>> something?
>>>
>>>
>>> - to UTC timestamp *** FAILED ***
>>>   "2016-03-13 [02]:00:00.0" did not equal "2016-03-13 [10]:00:00.0"
>>> (DateTimeUtilsSuite.scala:506)
>>>
>>> I know, we talked about this for the 1.6.2 RC, but I reproduced this

Re: Welcoming Yanbo Liang as a committer

2016-06-04 Thread Nick Pentreath
Congratulations Yanbo and welcome
On Sat, 4 Jun 2016 at 10:17, Hortonworks  wrote:

> Congratulations, Yanbo
>
> Zhan Zhang
>
> Sent from my iPhone
>
> > On Jun 3, 2016, at 8:39 PM, Dongjoon Hyun  wrote:
> >
> > Congratulations
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Cannot build master with sbt

2016-05-25 Thread Nick Pentreath
I've filed https://issues.apache.org/jira/browse/SPARK-15525

For now, you would have to check out sbt-antlr4 at
https://github.com/ihji/sbt-antlr4/commit/23eab68b392681a7a09f6766850785afe8dfa53d
(since
I don't see any branches or tags in the github repo for different
versions), and sbt publishLocal to get the dependency locally.

On Wed, 25 May 2016 at 15:13 Yiannis Gkoufas  wrote:

> Hi there,
>
> I have cloned the latest version from github.
> I am using scala 2.10.x
> When I invoke
>
> build/sbt clean package
>
> I get the exceptions because for the sbt-antlr library:
>
> [warn] module not found: com.simplytyped#sbt-antlr4;0.7.10
> [warn]  typesafe-ivy-releases: tried
> [warn]
> https://repo.typesafe.com/typesafe/ivy-releases/com.simplytyped/sbt-antlr4/scala_2.10/sbt_0.13/0.7.10/ivys/ivy.xml
> [warn]  sbt-plugin-releases: tried
> [warn]
> https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.simplytyped/sbt-antlr4/scala_2.10/sbt_0.13/0.7.10/ivys/ivy.xml
> [warn]  local: tried
> [warn]
> /home/johngouf/.ivy2/local/com.simplytyped/sbt-antlr4/scala_2.10/sbt_0.13/0.7.10/ivys/ivy.xml
> [warn]  public: tried
> [warn]
> https://repo1.maven.org/maven2/com/simplytyped/sbt-antlr4_2.10_0.13/0.7.10/sbt-antlr4-0.7.10.pom
> [warn]  simplytyped: tried
> [warn]
> http://simplytyped.github.io/repo/releases/com/simplytyped/sbt-antlr4_2.10_0.13/0.7.10/sbt-antlr4-0.7.10.pom
> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
> [warn] ::
> [warn] ::  UNRESOLVED DEPENDENCIES ::
> [warn] ::
> [warn] :: com.simplytyped#sbt-antlr4;0.7.10: not found
> [warn] ::
> [warn]
> [warn] Note: Some unresolved dependencies have extra attributes.
> Check that these dependencies exist with the requested attributes.
> [warn] com.simplytyped:sbt-antlr4:0.7.10 (scalaVersion=2.10,
> sbtVersion=0.13)
> [warn]
> [warn] Note: Unresolved dependencies path:
> [warn] com.simplytyped:sbt-antlr4:0.7.10 (scalaVersion=2.10,
> sbtVersion=0.13) (/home/johngouf/IOT/spark/project/plugins.sbt#L26-27)
> [warn]   +- plugins:plugins:0.1-SNAPSHOT (scalaVersion=2.10,
> sbtVersion=0.13)
> sbt.ResolveException: unresolved dependency:
> com.simplytyped#sbt-antlr4;0.7.10: not found
>
> Any idea what is the problem here?
>
> Thanks!
>


Re: [VOTE] Removing module maintainer process

2016-05-23 Thread Nick Pentreath
+1 (binding)
On Mon, 23 May 2016 at 04:19, Matei Zaharia  wrote:

> Correction, let's run this for 72 hours, so until 9 PM EST May 25th.
>
> > On May 22, 2016, at 8:34 PM, Matei Zaharia 
> wrote:
> >
> > It looks like the discussion thread on this has only had positive
> replies, so I'm going to call a VOTE. The proposal is to remove the
> maintainer process in
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers
> <
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers>
> given that it doesn't seem to have had a huge impact on the project, and it
> can unnecessarily create friction in contributing. We already have +1s from
> Mridul, Tom, Andrew Or and Imran on that thread.
> >
> > I'll leave the VOTE open for 48 hours, until 9 PM EST on May 24, 2016.
> >
> > Matei
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Cross Validator to work with K-Fold value of 1?

2016-05-02 Thread Nick Pentreath
There is a JIRA and PR around for supporting polynomial expansion with
degree 1. Offhand I can't recall if it's been merged
On Mon, 2 May 2016 at 17:45, Julio Antonio Soto de Vicente 
wrote:

> Hi,
>
> Same goes for the PolynomialExpansion in org.apache.spark.ml.feature. It
> would be dice to cross-validate with degree 1 polynomial expansion (this
> is, with no expansion at all) vs other degree polynomial expansions.
> Unfortunately, degree is forced to be >= 2.
>
> --
> Julio
>
> > El 2 may 2016, a las 9:05, Rahul Tanwani 
> escribió:
> >
> > Hi,
> >
> > In certain cases (mostly due to time constraints), we need some model to
> run
> > without cross validation. In such a case, since k-fold value for cross
> > validator cannot be one, we have to maintain two different code paths to
> > achieve both the scenarios (with and without cross validation).
> >
> > Would it be an okay idea to generalize the cross validator so it can work
> > with k-fold value of 1? The only purpose for this is to avoid maintaining
> > two different code paths and in functionality it should be similar to as
> if
> > the cross validation is not present.
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Cross-Validator-to-work-with-K-Fold-value-of-1-tp17404.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Duplicated fit into TrainValidationSplit

2016-04-27 Thread Nick Pentreath
You should find that the first set of fits are called on the training set,
and the resulting models evaluated on the validation set. The final best
model is then retrained on the entire dataset. This is standard practice -
usually the dataset passed to the train validation split is itself further
split into a training and test set, where the final best model is evaluated
against the test set.
On Wed, 27 Apr 2016 at 14:30, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

> Hi guys, I was testing a pipeline here, and found a possible duplicated
> call to fit method into the
> org.apache.spark.ml.tuning.TrainValidationSplit
> 
> class
> In line 110 there is a call to est.fit method that call fit in all
> parameter combinations that we have setup.
> Down in the line 128, after discovering which is the bestmodel, we call
> fit aggain using the bestIndex, wouldn't be better to just access the
> result of the already call fit method stored in the models val?
>
> Kind regards,
> Dirceu
>


Organizing Spark ML example packages

2016-04-14 Thread Nick Pentreath
Hey Spark devs

I noticed that we now have a large number of examples for ML & MLlib in the
examples project - 57 for ML and 67 for MLLIB to be precise. This is bound
to get larger as we add features (though I know there are some PRs to clean
up duplicated examples).

What do you think about organizing them into packages to match the use case
and the structure of the code base? e.g.

org.apache.spark.examples.ml.recommendation

org.apache.spark.examples.ml.feature

and so on...

Is it worth doing? The doc pages with include_example would need updating,
and the run_example script input would just need to change the package
slightly. Did I miss any potential issue?

N


Re: ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Ah I got it - Seq[(Int, Float)] is actually represented as Seq[Row] (seq of
struct type) internally.

So a further extraction is required, e.g. row => row.getSeq[Row](1).map { r
=> r.getInt(0) }

On Wed, 6 Apr 2016 at 13:35 Nick Pentreath <nick.pentre...@gmail.com> wrote:

> Hi there,
>
> In writing some tests for a PR I'm working on, with a more complex array
> type in a DF, I ran into this issue (running off latest master).
>
> Any thoughts?
>
> *// create DF with a column of Array[(Int, Double)]*
> val df = sc.parallelize(Seq(
> (0, Array((1, 6.0), (1, 4.0))),
> (1, Array((1, 3.0), (2, 1.0))),
> (2, Array((3, 3.0), (4, 6.0
> ).toDF("id", "predictions")
>
> *// extract the field from the Row, and use map to extract first element
> of tuple*
> *// the type of RDD appears correct*
> scala> df.rdd.map { row => row.getSeq[(Int, Double)](1).map(_._1) }
> res14: org.apache.spark.rdd.RDD[Seq[Int]] = MapPartitionsRDD[32] at map at
> :27
>
> *// however, calling collect on the same expression throws
> ClassCastException*
> scala> df.rdd.map { row => row.getSeq[(Int, Double)](1).map(_._1) }.collect
> 16/04/06 13:02:49 ERROR Executor: Exception in task 5.0 in stage 10.0 (TID
> 74)
> java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be
> cast to scala.Tuple2
> at
> $line54.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1$$anonfun$apply$1.apply(:27)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at
> $line54.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:27)
> at
> $line54.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:27)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
> at scala.collection.AbstractIterator.to(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
> at
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:880)
> at
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:880)
>
> *// can collect the extracted field*
> *// again, return type appears correct*
> scala> df.rdd.map { row => row.getSeq[(Int, Double)](1) }.collect
> res23: Array[Seq[(Int, Double)]] = Array(WrappedArray([1,6.0], [1,4.0]),
> WrappedArray([1,3.0], [2,1.0]), WrappedArray([3,3.0], [4,6.0]))
>
> *// trying to apply map to extract first element of tuple fails*
> scala> df.rdd.map { row => row.getSeq[(Int, Double)](1)
> }.collect.map(_.map(_._1))
> java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be
> cast to scala.Tuple2
>   at $anonfun$2$$anonfun$apply$1.apply(:27)
>   at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at $anonfun$2.apply(:27)
>   at $anonfun$2.apply(:27)
>   at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>


ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Hi there,

In writing some tests for a PR I'm working on, with a more complex array
type in a DF, I ran into this issue (running off latest master).

Any thoughts?

*// create DF with a column of Array[(Int, Double)]*
val df = sc.parallelize(Seq(
(0, Array((1, 6.0), (1, 4.0))),
(1, Array((1, 3.0), (2, 1.0))),
(2, Array((3, 3.0), (4, 6.0
).toDF("id", "predictions")

*// extract the field from the Row, and use map to extract first element of
tuple*
*// the type of RDD appears correct*
scala> df.rdd.map { row => row.getSeq[(Int, Double)](1).map(_._1) }
res14: org.apache.spark.rdd.RDD[Seq[Int]] = MapPartitionsRDD[32] at map at
:27

*// however, calling collect on the same expression throws
ClassCastException*
scala> df.rdd.map { row => row.getSeq[(Int, Double)](1).map(_._1) }.collect
16/04/06 13:02:49 ERROR Executor: Exception in task 5.0 in stage 10.0 (TID
74)
java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be
cast to scala.Tuple2
at
$line54.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1$$anonfun$apply$1.apply(:27)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at
$line54.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:27)
at
$line54.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:27)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1194)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:880)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:880)

*// can collect the extracted field*
*// again, return type appears correct*
scala> df.rdd.map { row => row.getSeq[(Int, Double)](1) }.collect
res23: Array[Seq[(Int, Double)]] = Array(WrappedArray([1,6.0], [1,4.0]),
WrappedArray([1,3.0], [2,1.0]), WrappedArray([3,3.0], [4,6.0]))

*// trying to apply map to extract first element of tuple fails*
scala> df.rdd.map { row => row.getSeq[(Int, Double)](1)
}.collect.map(_.map(_._1))
java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be
cast to scala.Tuple2
  at $anonfun$2$$anonfun$apply$1.apply(:27)
  at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at $anonfun$2.apply(:27)
  at $anonfun$2.apply(:27)
  at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)


Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Nick Pentreath
+1 for this proposal - as you mention I think it's the defacto current
situation anyway.

Note that from a developer view it's just the user-facing API that will be
only "ml" - the majority of the actual algorithms still operate on RDDs
under the good currently.
On Wed, 6 Apr 2016 at 05:03, Chris Fregly  wrote:

> perhaps renaming to Spark ML would actually clear up code and
> documentation confusion?
>
> +1 for rename
>
> On Apr 5, 2016, at 7:00 PM, Reynold Xin  wrote:
>
> +1
>
> This is a no brainer IMO.
>
>
> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley 
> wrote:
>
>> +1  By the way, the JIRA for tracking (Scala) API parity is:
>> https://issues.apache.org/jira/browse/SPARK-4591
>>
>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia 
>> wrote:
>>
>>> This sounds good to me as well. The one thing we should pay attention to
>>> is how we update the docs so that people know to start with the spark.ml
>>> classes. Right now the docs list spark.mllib first and also seem more
>>> comprehensive in that area than in spark.ml, so maybe people naturally
>>> move towards that.
>>>
>>> Matei
>>>
>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
>>>
>>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>>> need to port over in order to reach feature parity. -Xiangrui
>>>
>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 Overall this sounds good to me. One question I have is that in
 addition to the ML algorithms we have a number of linear algebra
 (various distributed matrices) and statistical methods in the
 spark.mllib package. Is the plan to port or move these to the spark.ml
 namespace in the 2.x series ?

 Thanks
 Shivaram

 On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
 > FWIW, all of that sounds like a good plan to me. Developing one API is
 > certainly better than two.
 >
 > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng 
 wrote:
 >> Hi all,
 >>
 >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
 built
 >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
 API has
 >> been developed under the spark.ml package, while the old RDD-based
 API has
 >> been developed in parallel under the spark.mllib package. While it
 was
 >> easier to implement and experiment with new APIs under a new
 package, it
 >> became harder and harder to maintain as both packages grew bigger and
 >> bigger. And new users are often confused by having two sets of APIs
 with
 >> overlapped functions.
 >>
 >> We started to recommend the DataFrame-based API over the RDD-based
 API in
 >> Spark 1.5 for its versatility and flexibility, and we saw the
 development
 >> and the usage gradually shifting to the DataFrame-based API. Just
 counting
 >> the lines of Scala code, from 1.5 to the current master we added
 ~1
 >> lines to the DataFrame-based API while ~700 to the RDD-based API.
 So, to
 >> gather more resources on the development of the DataFrame-based API
 and to
 >> help users migrate over sooner, I want to propose switching
 RDD-based MLlib
 >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
 >>
 >> * We do not accept new features in the RDD-based spark.mllib
 package, unless
 >> they block implementing new features in the DataFrame-based spark.ml
 >> package.
 >> * We still accept bug fixes in the RDD-based API.
 >> * We will add more features to the DataFrame-based API in the 2.x
 series to
 >> reach feature parity with the RDD-based API.
 >> * Once we reach feature parity (possibly in Spark 2.2), we will
 deprecate
 >> the RDD-based API.
 >> * We will remove the RDD-based API from the main Spark repo in Spark
 3.0.
 >>
 >> Though the RDD-based API is already in de facto maintenance mode,
 this
 >> announcement will make it clear and hence important to both MLlib
 developers
 >> and users. So we’d greatly appreciate your feedback!
 >>
 >> (As a side note, people sometimes use “Spark ML” to refer to the
 >> DataFrame-based API or even the entire MLlib component. This also
 causes
 >> confusion. To be clear, “Spark ML” is not an official name and there
 are no
 >> plans to rename MLlib to “Spark ML” at this time.)
 >>
 >> Best,
 >> Xiangrui
 >
 > -
 > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 > For additional commands, e-mail: user-h...@spark.apache.org
 >

>>>
>>>

Re: Spark ML - Scaling logistic regression for many features

2016-03-19 Thread Nick Pentreath
No, I didn't yet - feel free to create a JIRA.



On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann <daniel.siegm...@teamaol.com>
wrote:

> Hi Nick,
>
> Thanks again for your help with this. Did you create a ticket in JIRA for
> investigating sparse models in LR and / or multivariate summariser? If so,
> can you give me the issue key(s)? If not, would you like me to create these
> tickets?
>
> I'm going to look into this some more and see if I can figure out how to
> implement these fixes.
>
> ~Daniel Siegmann
>
> On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> Also adding dev list in case anyone else has ideas / views.
>>
>> On Sat, 12 Mar 2016 at 12:52, Nick Pentreath <nick.pentre...@gmail.com>
>> wrote:
>>
>>> Thanks for the feedback.
>>>
>>> I think Spark can certainly meet your use case when your data size
>>> scales up, as the actual model dimension is very small - you will need to
>>> use those indexers or some other mapping mechanism.
>>>
>>> There is ongoing work for Spark 2.0 to make it easier to use models
>>> outside of Spark - also see PMML export (I think mllib logistic regression
>>> is supported but I have to check that). That will help use spark models in
>>> serving environments.
>>>
>>> Finally, I will add a JIRA to investigate sparse models for LR - maybe
>>> also a ticket for multivariate summariser (though I don't think in practice
>>> there will be much to gain).
>>>
>>>
>>> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann <
>>> daniel.siegm...@teamaol.com> wrote:
>>>
>>>> Thanks for the pointer to those indexers, those are some good examples.
>>>> A good way to go for the trainer and any scoring done in Spark. I will
>>>> definitely have to deal with scoring in non-Spark systems though.
>>>>
>>>> I think I will need to scale up beyond what single-node liblinear can
>>>> practically provide. The system will need to handle much larger sub-samples
>>>> of this data (and other projects might be larger still). Additionally, the
>>>> system needs to train many models in parallel (hyper-parameter optimization
>>>> with n-fold cross-validation, multiple algorithms, different sets of
>>>> features).
>>>>
>>>> Still, I suppose we'll have to consider whether Spark is the best
>>>> system for this. For now though, my job is to see what can be achieved with
>>>> Spark.
>>>>
>>>>
>>>>
>>>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> Ok, I think I understand things better now.
>>>>>
>>>>> For Spark's current implementation, you would need to map those
>>>>> features as you mention. You could also use say StringIndexer ->
>>>>> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with
>>>>> the mapping and training (e.g.
>>>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
>>>>> Pipeline supports persistence.
>>>>>
>>>>> But it depends on your scoring use case too - a Spark pipeline can be
>>>>> saved and then reloaded, but you need all of Spark dependencies in your
>>>>> serving app which is often not ideal. If you're doing bulk scoring 
>>>>> offline,
>>>>> then it may suit.
>>>>>
>>>>> Honestly though, for that data size I'd certainly go with something
>>>>> like Liblinear :) Spark will ultimately scale better with # training
>>>>> examples for very large scale problems. However there are definitely
>>>>> limitations on model dimension and sparse weight vectors currently. There
>>>>> are potential solutions to these but they haven't been implemented as yet.
>>>>>
>>>>> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <
>>>>> daniel.siegm...@teamaol.com> wrote:
>>>>>
>>>>>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <
>>>>>> nick.pentre...@gmail.com> wrote:
>>>>>>
>>>>>>> Would you mind letting us know the # training examples in the
>>>>>>> datasets? Also, what do your features look like? Are they text, 
>>>>>>> categorical
>>>>>>> etc? You mention that mo

Re: Spark ML - Scaling logistic regression for many features

2016-03-12 Thread Nick Pentreath
Also adding dev list in case anyone else has ideas / views.

On Sat, 12 Mar 2016 at 12:52, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Thanks for the feedback.
>
> I think Spark can certainly meet your use case when your data size scales
> up, as the actual model dimension is very small - you will need to use
> those indexers or some other mapping mechanism.
>
> There is ongoing work for Spark 2.0 to make it easier to use models
> outside of Spark - also see PMML export (I think mllib logistic regression
> is supported but I have to check that). That will help use spark models in
> serving environments.
>
> Finally, I will add a JIRA to investigate sparse models for LR - maybe
> also a ticket for multivariate summariser (though I don't think in practice
> there will be much to gain).
>
>
> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann <daniel.siegm...@teamaol.com>
> wrote:
>
>> Thanks for the pointer to those indexers, those are some good examples. A
>> good way to go for the trainer and any scoring done in Spark. I will
>> definitely have to deal with scoring in non-Spark systems though.
>>
>> I think I will need to scale up beyond what single-node liblinear can
>> practically provide. The system will need to handle much larger sub-samples
>> of this data (and other projects might be larger still). Additionally, the
>> system needs to train many models in parallel (hyper-parameter optimization
>> with n-fold cross-validation, multiple algorithms, different sets of
>> features).
>>
>> Still, I suppose we'll have to consider whether Spark is the best system
>> for this. For now though, my job is to see what can be achieved with Spark.
>>
>>
>>
>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> Ok, I think I understand things better now.
>>>
>>> For Spark's current implementation, you would need to map those features
>>> as you mention. You could also use say StringIndexer -> OneHotEncoder or
>>> VectorIndexer. You could create a Pipeline to deal with the mapping and
>>> training (e.g.
>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
>>> Pipeline supports persistence.
>>>
>>> But it depends on your scoring use case too - a Spark pipeline can be
>>> saved and then reloaded, but you need all of Spark dependencies in your
>>> serving app which is often not ideal. If you're doing bulk scoring offline,
>>> then it may suit.
>>>
>>> Honestly though, for that data size I'd certainly go with something like
>>> Liblinear :) Spark will ultimately scale better with # training examples
>>> for very large scale problems. However there are definitely limitations on
>>> model dimension and sparse weight vectors currently. There are potential
>>> solutions to these but they haven't been implemented as yet.
>>>
>>> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <
>>> daniel.siegm...@teamaol.com> wrote:
>>>
>>>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> Would you mind letting us know the # training examples in the
>>>>> datasets? Also, what do your features look like? Are they text, 
>>>>> categorical
>>>>> etc? You mention that most rows only have a few features, and all rows
>>>>> together have a few 10,000s features, yet your max feature value is 20
>>>>> million. How are your constructing your feature vectors to get a 20 
>>>>> million
>>>>> size? The only realistic way I can see this situation occurring in 
>>>>> practice
>>>>> is with feature hashing (HashingTF).
>>>>>
>>>>
>>>> The sub-sample I'm currently training on is about 50K rows, so ...
>>>> small.
>>>>
>>>> The features causing this issue are numeric (int) IDs for ... lets
>>>> call it "Thing". For each Thing in the record, we set the feature
>>>> Thing.id to a value of 1.0 in our vector (which is of course a
>>>> SparseVector). I'm not sure how IDs are generated for Things, but they
>>>> can be large numbers.
>>>>
>>>> The largest Thing ID is around 20 million, so that ends up being the
>>>> size of the vector. But in fact there are fewer than 10,000 unique Thing
>>>> IDs in this data. The mean number of features per record in what I'm
>>>> currently tr

Re: Running ALS on comparitively large RDD

2016-03-10 Thread Nick Pentreath
Could you provide more details about:
1. Data set size (# ratings, # users and # products)
2. Spark cluster set up and version

Thanks

On Fri, 11 Mar 2016 at 05:53 Deepak Gopalakrishnan  wrote:

> Hello All,
>
> I've been running Spark's ALS on a dataset of users and rated items. I
> first encode my users to integers by using an auto increment function (
> just like zipWithIndex), I do the same for my items. I then create an RDD
> of the ratings and feed it to ALS.
>
> My issue is that the ALS algorithm never completes. Attached is a
> screenshot of the stages window.
>
> Any help will be greatly appreciated
>
> --
> Regards,
> *Deepak Gopalakrishnan*
> *Mobile*:+918891509774
> *Skype* : deepakgk87
> http://myexps.blogspot.com
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org


Re: ML ALS API

2016-03-08 Thread Nick Pentreath
Hi Maciej

Yes, that *train* method is intended to be public, but it is marked as
*DeveloperApi*, which means that backward compatibility is not necessarily
guaranteed, and that method may change. Having said that, even APIs marked
as DeveloperApi do tend to be relatively stable.

As the comment mentions:

 * :: DeveloperApi ::
 * An implementation of ALS that supports *generic ID types*, specialized
for Int and Long. This is
 * exposed as a developer API for users who do need other ID types. But it
is not recommended
 * because it increases the shuffle size and memory requirement during
training.

This *train* method is intended for the use case where user and item ids
are not the default Int (e.g. String). As you can see it returns the factor
RDDs directly, as opposed to an ALSModel instance, so overall it is a
little less user-friendly.

The *Float* ratings are to save space and make ALS more efficient overall.
That will not change in 2.0+ (especially since the precision of ratings is
not very important).

Hope that helps.

On Tue, 8 Mar 2016 at 08:20 Maciej Szymkiewicz 
wrote:

> Can I ask for a clarifications regarding ml.recommendation.ALS:
>
> - is train method
> (
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L598
> )
> intended to be be public?
> - Rating class
> (
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L436)is
> using float instead of double like its MLLib counterpart. Is it going to
> be a default encoding in 2.0+?
>
> --
> Best,
> Maciej Szymkiewicz
>
>
>


Re: Proposal

2016-01-30 Thread Nick Pentreath
Hi there

Sounds like a fun project :)

I'd recommend getting familiar with the existing k-means implementation as well 
as bisecting k-means in Spark, and then implementing yours based off that. You 
should focus on using the new ML pipelines API, and release it as a package on 
spark-packages.org.

If it got lots of use cases from there, it could be considered for inclusion in 
ML core in the future.

Good luck!

Sent from my iPhone

> On 31 Jan 2016, at 00:23, Acelot  wrote:
> 
> Hi All,
> As part of my final project at university I would try to build an alternative 
> version of k-means algorithm, it's called k-modes introduced here: Improving 
> the Accuracy and Efficiency of the k-means Clustering Algorithm paper (Link: 
> http://www.iaeng.org/publication/WCE2009/WCE2009_pp308-312.pdf). I would like 
> to know any related work. If someone is interested to work in this project 
> contact with me,
> Kind regards,
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Elasticsearch sink for metrics

2016-01-15 Thread Nick Pentreath
I haven't come across anything, but could you provide more detail on what
issues you're encountering?



On Fri, Jan 15, 2016 at 11:09 AM, Pete Robbins  wrote:

> Has anyone tried pushing Spark metrics into elasticsearch? We have other
> metrics, eg some runtime information, going into ES and would like to be
> able to combine this with the Spark metrics for visualization with Kibana.
>
> I experimented with a new sink using ES's ElasticsearchReporter for the
> Coda Hale metrics but have a few issues with default mappings.
>
> Has anyone already implemented this before I start to dig deeper?
>
> Cheers,
>
>
>


Re: Write access to wiki

2016-01-12 Thread Nick Pentreath
I'd also like to get Wiki write access - at the least it allows a few of us
to amend the "Powered By" and similar pages when those requests come
through (Sean has been doing a lot of that recently :)

On Mon, Jan 11, 2016 at 11:01 PM, Sean Owen  wrote:

> ... I forget who can give access -- is it INFRA at Apache or one of us?
> I can apply any edit you need in the meantime.
>
> Shane may be able to fill you in on how the Jenkins build is set up.
>
> On Mon, Jan 11, 2016 at 8:56 PM, Mark Grover  wrote:
> > Hi all,
> > May I please get write access to the useful tools wiki page?
> >
> > I did some investigation related to docker integration tests and want to
> > list out the pre-requisites required on the machine for those tests to
> pass,
> > on that page.
> >
> > On a related note, I was trying to search for any puppet recipes we
> maintain
> > for setting up build slaves. If our Jenkins infra were wiped out, how do
> we
> > rebuild the slave?
> >
> > Thanks in advance!
> >
> > Mark
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
cc'ing dev list

Ok, looks like when the KCL version was updated in
https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
probably leading to dependency conflict, though as Burak mentions its hard
to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
in driver or worker logs, so any exception is getting swallowed somewhere.

Run starting. Expected test count is: 4
KinesisStreamSuite:
Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
Kinesis streams for tests.
- KinesisUtils API
- RDD generation
- basic operation *** FAILED ***
  The code passed to eventually never returned normally. Attempted 13 times
over 2.04 minutes. Last failure message: Set() did not equal Set(5, 10,
1, 6, 9, 2, 7, 3, 8, 4)
  Data received does not match data sent. (KinesisStreamSuite.scala:188)
- failure recovery *** FAILED ***
  The code passed to eventually never returned normally. Attempted 63 times
over 2.02863831 minutes. Last failure message: isCheckpointPresent
was true, but 0 was not greater than 10. (KinesisStreamSuite.scala:228)
Run completed in 5 minutes, 0 seconds.
Total number of tests run: 4
Suites: completed 1, aborted 0
Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
*** 2 TESTS FAILED ***
[INFO]

[INFO] BUILD FAILURE
[INFO]



KCL 1.3.0 depends on *1.9.37* SDK (
https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
while the Spark Kinesis dependency was kept at *1.9.16.*

I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS SDK
1.9.37 and everything works.

Run starting. Expected test count is: 28
KinesisBackedBlockRDDSuite:
Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
Kinesis streams for tests.
- Basic reading from Kinesis
- Read data available in both block manager and Kinesis
- Read data available only in block manager, not in Kinesis
- Read data available only in Kinesis, not in block manager
- Read data available partially in block manager, rest in Kinesis
- Test isBlockValid skips block fetching from block manager
- Test whether RDD is valid after removing blocks from block anager
KinesisStreamSuite:
- KinesisUtils API
- RDD generation
- basic operation
- failure recovery
KinesisReceiverSuite:
- check serializability of SerializableAWSCredentials
- process records including store and checkpoint
- shouldn't store and checkpoint when receiver is stopped
- shouldn't checkpoint when exception occurs during store
- should set checkpoint time to currentTime + checkpoint interval upon
instantiation
- should checkpoint if we have exceeded the checkpoint interval
- shouldn't checkpoint if we have not exceeded the checkpoint interval
- should add to time when advancing checkpoint
- shutdown should checkpoint if the reason is TERMINATE
- shutdown should not checkpoint if the reason is something other than
TERMINATE
- retry success on first attempt
- retry success on second attempt after a Kinesis throttling exception
- retry success on second attempt after a Kinesis dependency exception
- retry failed after a shutdown exception
- retry failed after an invalid state exception
- retry failed after unexpected exception
- retry failed after exhausing all retries
Run completed in 3 minutes, 28 seconds.
Total number of tests run: 28
Suites: completed 4, aborted 0
Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

So this is a regression in Spark Streaming Kinesis 1.5.2 - @Brian can you
file a JIRA for this?

@dev-list, since KCL brings in AWS SDK dependencies itself, is it necessary
to declare an explicit dependency on aws-java-sdk in the Kinesis POM? Also,
from KCL 1.5.0+, only the relevant components used from the AWS SDKs are
brought in, making things a bit leaner (this can be upgraded in Spark
1.7/2.0 perhaps). All local tests (and integration tests) pass with
removing the explicit dependency and only depending on KCL. Is aws-java-sdk
used anywhere else (AFAIK it is not, but in case I missed something let me
know any good reason to keep the explicit dependency)?

N



On Fri, Dec 11, 2015 at 6:55 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Yeah also the integration tests need to be specifically run - I would have
> thought the contributor would have run those tests and also tested the
> change themselves using live Kinesis :(
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Fri, Dec 11, 2015 at 6:18 AM, Burak Yavuz <brk...@gmail.com> wrote:
>
>> I don't think the Kinesis tests specifically ran when that was merged
>> into 1.5.2 :(
>> https://github.com/apache/spark/pull/8957
>>
>> https://github.com/apache/spark/commit/

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
Is that PR against master branch?




S3 read comes from Hadoop / jet3t afaik



—
Sent from Mailbox

On Fri, Dec 11, 2015 at 5:38 PM, Brian London <brianmlon...@gmail.com>
wrote:

> That's good news  I've got a PR in to up the SDK version to 1.10.40 and the
> KCL to 1.6.1 which I'm running tests on locally now.
> Is the AWS SDK not used for reading/writing from S3 or do we get that for
> free from the Hadoop dependencies?
> On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>> cc'ing dev list
>>
>> Ok, looks like when the KCL version was updated in
>> https://github.com/apache/spark/pull/8957, the AWS SDK version was not,
>> probably leading to dependency conflict, though as Burak mentions its hard
>> to debug as no exceptions seem to get thrown... I've tested 1.5.2 locally
>> and on my 1.5.2 EC2 cluster, and no data is received, and nothing shows up
>> in driver or worker logs, so any exception is getting swallowed somewhere.
>>
>> Run starting. Expected test count is: 4
>> KinesisStreamSuite:
>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>> Kinesis streams for tests.
>> - KinesisUtils API
>> - RDD generation
>> - basic operation *** FAILED ***
>>   The code passed to eventually never returned normally. Attempted 13
>> times over 2.04 minutes. Last failure message: Set() did not equal
>> Set(5, 10, 1, 6, 9, 2, 7, 3, 8, 4)
>>   Data received does not match data sent. (KinesisStreamSuite.scala:188)
>> - failure recovery *** FAILED ***
>>   The code passed to eventually never returned normally. Attempted 63
>> times over 2.02863831 minutes. Last failure message:
>> isCheckpointPresent was true, but 0 was not greater than 10.
>> (KinesisStreamSuite.scala:228)
>> Run completed in 5 minutes, 0 seconds.
>> Total number of tests run: 4
>> Suites: completed 1, aborted 0
>> Tests: succeeded 2, failed 2, canceled 0, ignored 0, pending 0
>> *** 2 TESTS FAILED ***
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>>
>>
>> KCL 1.3.0 depends on *1.9.37* SDK (
>> https://github.com/awslabs/amazon-kinesis-client/blob/1.3.0/pom.xml#L26)
>> while the Spark Kinesis dependency was kept at *1.9.16.*
>>
>> I've run the integration tests on branch-1.5 (1.5.3-SNAPSHOT) with AWS SDK
>> 1.9.37 and everything works.
>>
>> Run starting. Expected test count is: 28
>> KinesisBackedBlockRDDSuite:
>> Using endpoint URL https://kinesis.eu-west-1.amazonaws.com for creating
>> Kinesis streams for tests.
>> - Basic reading from Kinesis
>> - Read data available in both block manager and Kinesis
>> - Read data available only in block manager, not in Kinesis
>> - Read data available only in Kinesis, not in block manager
>> - Read data available partially in block manager, rest in Kinesis
>> - Test isBlockValid skips block fetching from block manager
>> - Test whether RDD is valid after removing blocks from block anager
>> KinesisStreamSuite:
>> - KinesisUtils API
>> - RDD generation
>> - basic operation
>> - failure recovery
>> KinesisReceiverSuite:
>> - check serializability of SerializableAWSCredentials
>> - process records including store and checkpoint
>> - shouldn't store and checkpoint when receiver is stopped
>> - shouldn't checkpoint when exception occurs during store
>> - should set checkpoint time to currentTime + checkpoint interval upon
>> instantiation
>> - should checkpoint if we have exceeded the checkpoint interval
>> - shouldn't checkpoint if we have not exceeded the checkpoint interval
>> - should add to time when advancing checkpoint
>> - shutdown should checkpoint if the reason is TERMINATE
>> - shutdown should not checkpoint if the reason is something other than
>> TERMINATE
>> - retry success on first attempt
>> - retry success on second attempt after a Kinesis throttling exception
>> - retry success on second attempt after a Kinesis dependency exception
>> - retry failed after a shutdown exception
>> - retry failed after an invalid state exception
>> - retry failed after unexpected exception
>> - retry failed after exhausing all retries
>> Run completed in 3 minutes, 28 seconds.
>> Total number of tests run: 28
>> Suites: completed 4, aborted 0
>> Tests: succeeded 28, failed 0, canceled 0, ignored 0, pending 0
>> All tests passed.
>>
>> So this is a regression

Re: ml.feature.Word2Vec.transform() very slow issue

2015-11-09 Thread Nick Pentreath
Seems a straightforward change that purely enhances efficiency, so yes
please submit a JIRA and PR for this

On Tue, Nov 10, 2015 at 8:56 AM, Sean Owen  wrote:

> Since it's a fairly expensive operation to build the Map, I tend to agree
> it should not happen in the loop.
>
> On Tue, Nov 10, 2015 at 5:08 AM, Yuming Wang  wrote:
>
>> Hi
>>
>>
>>
>> I found org.apache.spark.ml.feature.Word2Vec.transform() very slow.
>>
>> I think we should not read broadcast every sentence, so I fixed on my forked.
>>
>>
>>
>> https://github.com/979969786/spark/commit/a9f894df3671bb8df2f342de1820dab3185598f3
>>
>>
>>
>> I have use 2 number rows test it. Original version consume *5 minutes*,
>>
>>
>> ​
>>
>> and my version just consume *22 seconds* on same data.
>>
>>
>> ​
>>
>>
>>
>>
>> If I'm right, I will pull request.
>>
>>
>>
>> Thanks
>>
>>
>


Re: HyperLogLogUDT

2015-09-13 Thread Nick Pentreath
Thanks Yin




So how does one ensure a UDAF works with Tungsten and UnsafeRow buffers? Or is 
this something that will be included in the UDAF interface in future? 




Is there a performance difference between Extending UDAF vs Aggregate2?




It's also not clear to me how to handle inputs of different types? What if my 
UDAF can handle String and Long for example? Do I need to specify AnyType or is 
there a way to specify multiple types possible for a single input column?




If no performance difference and UDAF can work with Tungsten, then Herman does 
it perhaps make sense to use UDAF (but without a UDT as you've done for 
performance)? As it would then be easy to extend that UDAF and adjust the 
output types as needed. It also provides a really nice example of how to use 
the interface for something advanced and high performance.



—
Sent from Mailbox

On Sun, Sep 13, 2015 at 12:09 AM, Yin Huai <yh...@databricks.com> wrote:

> Hi Nick,
> The buffer exposed to UDAF interface is just a view of underlying buffer
> (this underlying buffer is shared by different aggregate functions and
> every function takes one or multiple slots). If you need a UDAF, extending
> UserDefinedAggregationFunction is the preferred
> approach. AggregateFunction2 is used for built-in aggregate function.
> Thanks,
> Yin
> On Sat, Sep 12, 2015 at 10:40 AM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>> Ok, that makes sense. So this is (a) more efficient, since as far as I can
>> see it is updating the HLL registers directly in the buffer for each value,
>> and (b) would be "Tungsten-compatible" as it can work against UnsafeRow? Is
>> it currently possible to specify an UnsafeRow as a buffer in a UDAF?
>>
>> So is extending AggregateFunction2 the preferred approach over the
>> UserDefinedAggregationFunction interface? Or it is that internal only?
>>
>> I see one of the main use cases for things like HLL / CMS and other
>> approximate data structure being the fact that you can store them as
>> columns representing distinct counts in an aggregation. And then do further
>> arbitrary aggregations on that data as required. e.g. store hourly
>> aggregate data, and compute daily or monthly aggregates from that, while
>> still keeping the ability to have distinct counts on certain fields.
>>
>> So exposing the serialized HLL as Array[Byte] say, so that it can be
>> further aggregated in a later DF operation, or saved to an external data
>> source, would be super useful.
>>
>>
>>
>> On Sat, Sep 12, 2015 at 6:06 PM, Herman van Hövell tot Westerflier <
>> hvanhov...@questtec.nl> wrote:
>>
>>> I am typically all for code re-use. The reason for writing this is to
>>> prevent the indirection of a UDT and work directly against memory. A UDT
>>> will work fine at the moment because we still use
>>> GenericMutableRow/SpecificMutableRow as aggregation buffers. However if you
>>> would use an UnsafeRow as an AggregationBuffer (which is attractive when
>>> you have a lot of groups during aggregation) the use of an UDT is either
>>> impossible or it would become very slow because it would require us to
>>> deserialize/serialize a UDT on every update.
>>>
>>> As for compatibility, the implementation produces exactly the same
>>> results as the ClearSpring implementation. You could easily export the
>>> HLL++ register values to the current ClearSpring implementation and export
>>> those.
>>>
>>> Met vriendelijke groet/Kind regards,
>>>
>>> Herman van Hövell tot Westerflier
>>>
>>> QuestTec B.V.
>>> Torenwacht 98
>>> 2353 DC Leiderdorp
>>> hvanhov...@questtec.nl
>>> +599 9 521 4402
>>>
>>>
>>> 2015-09-12 11:06 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.com>:
>>>
>>>> I should add that surely the idea behind UDT is exactly that it can (a)
>>>> fit automatically into DFs and Tungsten and (b) that it can be used
>>>> efficiently in writing ones own UDTs and UDAFs?
>>>>
>>>>
>>>> On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> Can I ask why you've done this as a custom implementation rather than
>>>>> using StreamLib, which is already implemented and widely used? It seems
>>>>> more portable to me to use a library - for example, I'd like to export the
>>>>> grouped data with raw HLLs to say Elasticsearch, and then do further
>>>>> on-demand aggregation in ES and visualization

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Ok, that makes sense. So this is (a) more efficient, since as far as I can
see it is updating the HLL registers directly in the buffer for each value,
and (b) would be "Tungsten-compatible" as it can work against UnsafeRow? Is
it currently possible to specify an UnsafeRow as a buffer in a UDAF?

So is extending AggregateFunction2 the preferred approach over the
UserDefinedAggregationFunction interface? Or it is that internal only?

I see one of the main use cases for things like HLL / CMS and other
approximate data structure being the fact that you can store them as
columns representing distinct counts in an aggregation. And then do further
arbitrary aggregations on that data as required. e.g. store hourly
aggregate data, and compute daily or monthly aggregates from that, while
still keeping the ability to have distinct counts on certain fields.

So exposing the serialized HLL as Array[Byte] say, so that it can be
further aggregated in a later DF operation, or saved to an external data
source, would be super useful.



On Sat, Sep 12, 2015 at 6:06 PM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> I am typically all for code re-use. The reason for writing this is to
> prevent the indirection of a UDT and work directly against memory. A UDT
> will work fine at the moment because we still use
> GenericMutableRow/SpecificMutableRow as aggregation buffers. However if you
> would use an UnsafeRow as an AggregationBuffer (which is attractive when
> you have a lot of groups during aggregation) the use of an UDT is either
> impossible or it would become very slow because it would require us to
> deserialize/serialize a UDT on every update.
>
> As for compatibility, the implementation produces exactly the same results
> as the ClearSpring implementation. You could easily export the HLL++
> register values to the current ClearSpring implementation and export those.
>
> Met vriendelijke groet/Kind regards,
>
> Herman van Hövell tot Westerflier
>
> QuestTec B.V.
> Torenwacht 98
> 2353 DC Leiderdorp
> hvanhov...@questtec.nl
> +599 9 521 4402
>
>
> 2015-09-12 11:06 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.com>:
>
>> I should add that surely the idea behind UDT is exactly that it can (a)
>> fit automatically into DFs and Tungsten and (b) that it can be used
>> efficiently in writing ones own UDTs and UDAFs?
>>
>>
>> On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> Can I ask why you've done this as a custom implementation rather than
>>> using StreamLib, which is already implemented and widely used? It seems
>>> more portable to me to use a library - for example, I'd like to export the
>>> grouped data with raw HLLs to say Elasticsearch, and then do further
>>> on-demand aggregation in ES and visualization in Kibana etc.
>>>
>>> Others may want to do something similar into Hive, Cassandra, HBase or
>>> whatever they are using. In this case they'd need to use this particular
>>> implementation from Spark which may be tricky to include in a dependency
>>> etc.
>>>
>>> If there are enhancements, does it not make sense to do a PR to
>>> StreamLib? Or does this interact in some better way with Tungsten?
>>>
>>> I am unclear on how the interop with Tungsten raw memory works - some
>>> pointers on that and where to look in the Spark code would be helpful.
>>>
>>> On Sat, Sep 12, 2015 at 10:45 AM, Herman van Hövell tot Westerflier <
>>> hvanhov...@questtec.nl> wrote:
>>>
>>>> Hello Nick,
>>>>
>>>> I have been working on a (UDT-less) implementation of HLL++. You can
>>>> find the PR here: https://github.com/apache/spark/pull/8362. This
>>>> current implements the dense version of HLL++, which is a further
>>>> development of HLL. It returns a Long, but it shouldn't be to hard to
>>>> return a Row containing the cardinality and/or the HLL registers (the
>>>> binary data).
>>>>
>>>> I am curious what the stance is on using UDTs in the new UDAF
>>>> interface. Is this still viable? This wouldn't work with UnsafeRow for
>>>> instance. The OpenHashSetUDT for instance would be a nice building block
>>>> for CollectSet and all Distinct Aggregate operators. Are there any opinions
>>>> on this?
>>>>
>>>> Kind regards,
>>>>
>>>> Herman van Hövell tot Westerflier
>>>>
>>>> QuestTec B.V.
>>>> Torenwacht 98
>>>> 2353 DC Leiderdorp
>>>> hvanhov...@questtec.nl
>

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Inspired by this post:
http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/,
I've started putting together something based on the Spark 1.5 UDAF
interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141

Some questions -

1. How do I get the UDAF to accept input arguments of different type? We
can hash anything basically for HLL - Int, Long, String, Object, raw bytes
etc. Right now it seems we'd need to build a new UDAF for each input type,
which seems strange - I should be able to use one UDAF that can handle raw
input of different types, as well as handle existing HLLs that can be
merged/aggregated (e.g. for grouped data)
2. @Reynold, how would I ensure this works for Tungsten (ie against raw
bytes in memory)? Or does the new Aggregate2 stuff automatically do that?
Where should I look for examples on how this works internally?
3. I've based this on the Sum and Avg examples for the new UDAF interface -
any suggestions or issue please advise. Is the intermediate buffer
efficient?
4. The current HyperLogLogUDT is private - so I've had to make my own one
which is a bit pointless as it's copy-pasted. Any thoughts on exposing that
type? Or I need to make the package spark.sql ...

Nick

On Thu, Jul 2, 2015 at 8:06 AM, Reynold Xin <r...@databricks.com> wrote:

> Yes - it's very interesting. However, ideally we should have a version of
> hyperloglog that can work directly against some raw bytes in memory (rather
> than java objects), in order for this to fit the Tungsten execution model
> where everything is operating directly against some memory address.
>
> On Wed, Jul 1, 2015 at 11:00 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> Sure I can copy the code but my aim was more to understand:
>>
>> (A) if this is broadly interesting enough to folks to think about
>> updating / extending the existing UDAF within Spark
>> (b) how to register ones own custom UDAF - in which case it could be a
>> Spark package for example
>>
>> All examples deal with registering a UDF but nothing about UDAFs
>>
>> —
>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>
>>
>> On Wed, Jul 1, 2015 at 6:32 PM, Daniel Darabos <
>> daniel.dara...@lynxanalytics.com> wrote:
>>
>>> It's already possible to just copy the code from countApproxDistinct
>>> <https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1153>
>>>  and
>>> access the HLL directly, or do anything you like.
>>>
>>> On Wed, Jul 1, 2015 at 5:26 PM, Nick Pentreath <nick.pentre...@gmail.com
>>> > wrote:
>>>
>>>> Any thoughts?
>>>>
>>>> —
>>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>>
>>>>
>>>> On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> Hey Spark devs
>>>>>
>>>>> I've been looking at DF UDFs and UDAFs. The approx distinct is using
>>>>> hyperloglog,
>>>>> but there is only an option to return the count as a Long.
>>>>>
>>>>> It can be useful to be able to return and store the actual data
>>>>> structure (ie serialized HLL). This effectively allows one to do
>>>>> aggregation / rollups over columns while still preserving the ability to
>>>>> get distinct counts.
>>>>>
>>>>> For example, one can store daily aggregates of events, grouped by
>>>>> various columns, while storing for each grouping the HLL of say unique
>>>>> users. So you can get the uniques per day directly but could also very
>>>>> easily do arbitrary aggregates (say monthly, annually) and still be able 
>>>>> to
>>>>> get a unique count for that period by merging the daily HLLS.
>>>>>
>>>>> I did this a while back as a Hive UDAF (
>>>>> https://github.com/MLnick/hive-udf) which returns a Struct field
>>>>> containing a "cardinality" field and a "binary" field containing the
>>>>> serialized HLL.
>>>>>
>>>>> I was wondering if there would be interest in something like this? I
>>>>> am not so clear on how UDTs work with regards to SerDe - so could one 
>>>>> adapt
>>>>> the HyperLogLogUDT to be a Struct with the serialized HLL as a field as
>>>>> well as count as a field? Then I assume this would automatically play
>>>>> nicely with DataFrame I/O etc. The gotcha is one needs to then call
>>>>> "approx_count_field.count" (or is there a concept of a "default field" for
>>>>> a Struct?).
>>>>>
>>>>> Also, being able to provide the bitsize parameter may be useful...
>>>>>
>>>>> The same thinking would apply potentially to other approximate (and
>>>>> mergeable) data structures like T-Digest and maybe CMS.
>>>>>
>>>>> Nick
>>>>>
>>>>
>>>>
>>>
>>
>


Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
I should add that surely the idea behind UDT is exactly that it can (a) fit
automatically into DFs and Tungsten and (b) that it can be used efficiently
in writing ones own UDTs and UDAFs?


On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Can I ask why you've done this as a custom implementation rather than
> using StreamLib, which is already implemented and widely used? It seems
> more portable to me to use a library - for example, I'd like to export the
> grouped data with raw HLLs to say Elasticsearch, and then do further
> on-demand aggregation in ES and visualization in Kibana etc.
>
> Others may want to do something similar into Hive, Cassandra, HBase or
> whatever they are using. In this case they'd need to use this particular
> implementation from Spark which may be tricky to include in a dependency
> etc.
>
> If there are enhancements, does it not make sense to do a PR to StreamLib?
> Or does this interact in some better way with Tungsten?
>
> I am unclear on how the interop with Tungsten raw memory works - some
> pointers on that and where to look in the Spark code would be helpful.
>
> On Sat, Sep 12, 2015 at 10:45 AM, Herman van Hövell tot Westerflier <
> hvanhov...@questtec.nl> wrote:
>
>> Hello Nick,
>>
>> I have been working on a (UDT-less) implementation of HLL++. You can find
>> the PR here: https://github.com/apache/spark/pull/8362. This current
>> implements the dense version of HLL++, which is a further development of
>> HLL. It returns a Long, but it shouldn't be to hard to return a Row
>> containing the cardinality and/or the HLL registers (the binary data).
>>
>> I am curious what the stance is on using UDTs in the new UDAF interface.
>> Is this still viable? This wouldn't work with UnsafeRow for instance. The
>> OpenHashSetUDT for instance would be a nice building block for CollectSet
>> and all Distinct Aggregate operators. Are there any opinions on this?
>>
>> Kind regards,
>>
>> Herman van Hövell tot Westerflier
>>
>> QuestTec B.V.
>> Torenwacht 98
>> 2353 DC Leiderdorp
>> hvanhov...@questtec.nl
>> +599 9 521 4402
>>
>>
>> 2015-09-12 10:07 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.com>:
>>
>>> Inspired by this post:
>>> http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/,
>>> I've started putting together something based on the Spark 1.5 UDAF
>>> interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141
>>>
>>> Some questions -
>>>
>>> 1. How do I get the UDAF to accept input arguments of different type? We
>>> can hash anything basically for HLL - Int, Long, String, Object, raw bytes
>>> etc. Right now it seems we'd need to build a new UDAF for each input type,
>>> which seems strange - I should be able to use one UDAF that can handle raw
>>> input of different types, as well as handle existing HLLs that can be
>>> merged/aggregated (e.g. for grouped data)
>>> 2. @Reynold, how would I ensure this works for Tungsten (ie against raw
>>> bytes in memory)? Or does the new Aggregate2 stuff automatically do that?
>>> Where should I look for examples on how this works internally?
>>> 3. I've based this on the Sum and Avg examples for the new UDAF
>>> interface - any suggestions or issue please advise. Is the intermediate
>>> buffer efficient?
>>> 4. The current HyperLogLogUDT is private - so I've had to make my own
>>> one which is a bit pointless as it's copy-pasted. Any thoughts on exposing
>>> that type? Or I need to make the package spark.sql ...
>>>
>>> Nick
>>>
>>> On Thu, Jul 2, 2015 at 8:06 AM, Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> Yes - it's very interesting. However, ideally we should have a version
>>>> of hyperloglog that can work directly against some raw bytes in memory
>>>> (rather than java objects), in order for this to fit the Tungsten execution
>>>> model where everything is operating directly against some memory address.
>>>>
>>>> On Wed, Jul 1, 2015 at 11:00 PM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> Sure I can copy the code but my aim was more to understand:
>>>>>
>>>>> (A) if this is broadly interesting enough to folks to think about
>>>>> updating / extending the existing UDAF within Spark
>>>>> (b) how to register ones own custom UDAF - in which case it could be

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
Can I ask why you've done this as a custom implementation rather than using
StreamLib, which is already implemented and widely used? It seems more
portable to me to use a library - for example, I'd like to export the
grouped data with raw HLLs to say Elasticsearch, and then do further
on-demand aggregation in ES and visualization in Kibana etc.

Others may want to do something similar into Hive, Cassandra, HBase or
whatever they are using. In this case they'd need to use this particular
implementation from Spark which may be tricky to include in a dependency
etc.

If there are enhancements, does it not make sense to do a PR to StreamLib?
Or does this interact in some better way with Tungsten?

I am unclear on how the interop with Tungsten raw memory works - some
pointers on that and where to look in the Spark code would be helpful.

On Sat, Sep 12, 2015 at 10:45 AM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> Hello Nick,
>
> I have been working on a (UDT-less) implementation of HLL++. You can find
> the PR here: https://github.com/apache/spark/pull/8362. This current
> implements the dense version of HLL++, which is a further development of
> HLL. It returns a Long, but it shouldn't be to hard to return a Row
> containing the cardinality and/or the HLL registers (the binary data).
>
> I am curious what the stance is on using UDTs in the new UDAF interface.
> Is this still viable? This wouldn't work with UnsafeRow for instance. The
> OpenHashSetUDT for instance would be a nice building block for CollectSet
> and all Distinct Aggregate operators. Are there any opinions on this?
>
> Kind regards,
>
> Herman van Hövell tot Westerflier
>
> QuestTec B.V.
> Torenwacht 98
> 2353 DC Leiderdorp
> hvanhov...@questtec.nl
> +599 9 521 4402
>
>
> 2015-09-12 10:07 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.com>:
>
>> Inspired by this post:
>> http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/,
>> I've started putting together something based on the Spark 1.5 UDAF
>> interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141
>>
>> Some questions -
>>
>> 1. How do I get the UDAF to accept input arguments of different type? We
>> can hash anything basically for HLL - Int, Long, String, Object, raw bytes
>> etc. Right now it seems we'd need to build a new UDAF for each input type,
>> which seems strange - I should be able to use one UDAF that can handle raw
>> input of different types, as well as handle existing HLLs that can be
>> merged/aggregated (e.g. for grouped data)
>> 2. @Reynold, how would I ensure this works for Tungsten (ie against raw
>> bytes in memory)? Or does the new Aggregate2 stuff automatically do that?
>> Where should I look for examples on how this works internally?
>> 3. I've based this on the Sum and Avg examples for the new UDAF interface
>> - any suggestions or issue please advise. Is the intermediate buffer
>> efficient?
>> 4. The current HyperLogLogUDT is private - so I've had to make my own one
>> which is a bit pointless as it's copy-pasted. Any thoughts on exposing that
>> type? Or I need to make the package spark.sql ...
>>
>> Nick
>>
>> On Thu, Jul 2, 2015 at 8:06 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Yes - it's very interesting. However, ideally we should have a version
>>> of hyperloglog that can work directly against some raw bytes in memory
>>> (rather than java objects), in order for this to fit the Tungsten execution
>>> model where everything is operating directly against some memory address.
>>>
>>> On Wed, Jul 1, 2015 at 11:00 PM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
>>>> Sure I can copy the code but my aim was more to understand:
>>>>
>>>> (A) if this is broadly interesting enough to folks to think about
>>>> updating / extending the existing UDAF within Spark
>>>> (b) how to register ones own custom UDAF - in which case it could be a
>>>> Spark package for example
>>>>
>>>> All examples deal with registering a UDF but nothing about UDAFs
>>>>
>>>> —
>>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>>
>>>>
>>>> On Wed, Jul 1, 2015 at 6:32 PM, Daniel Darabos <
>>>> daniel.dara...@lynxanalytics.com> wrote:
>>>>
>>>>> It's already possible to just copy the code from countApproxDistinct
>>>>> <https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1153&

Re: HyperLogLogUDT

2015-07-02 Thread Nick Pentreath
Sure I can copy the code but my aim was more to understand:




(A) if this is broadly interesting enough to folks to think about updating / 
extending the existing UDAF within Spark

(b) how to register ones own custom UDAF - in which case it could be a Spark 
package for example 




All examples deal with registering a UDF but nothing about UDAFs



—
Sent from Mailbox

On Wed, Jul 1, 2015 at 6:32 PM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:

 It's already possible to just copy the code from countApproxDistinct
 https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1153
 and
 access the HLL directly, or do anything you like.
 On Wed, Jul 1, 2015 at 5:26 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:
 Any thoughts?

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath nick.pentre...@gmail.com
  wrote:

 Hey Spark devs

 I've been looking at DF UDFs and UDAFs. The approx distinct is using
 hyperloglog,
 but there is only an option to return the count as a Long.

 It can be useful to be able to return and store the actual data structure
 (ie serialized HLL). This effectively allows one to do aggregation /
 rollups over columns while still preserving the ability to get distinct
 counts.

 For example, one can store daily aggregates of events, grouped by various
 columns, while storing for each grouping the HLL of say unique users. So
 you can get the uniques per day directly but could also very easily do
 arbitrary aggregates (say monthly, annually) and still be able to get a
 unique count for that period by merging the daily HLLS.

 I did this a while back as a Hive UDAF (
 https://github.com/MLnick/hive-udf) which returns a Struct field
 containing a cardinality field and a binary field containing the
 serialized HLL.

 I was wondering if there would be interest in something like this? I am
 not so clear on how UDTs work with regards to SerDe - so could one adapt
 the HyperLogLogUDT to be a Struct with the serialized HLL as a field as
 well as count as a field? Then I assume this would automatically play
 nicely with DataFrame I/O etc. The gotcha is one needs to then call
 approx_count_field.count (or is there a concept of a default field for
 a Struct?).

 Also, being able to provide the bitsize parameter may be useful...

 The same thinking would apply potentially to other approximate (and
 mergeable) data structures like T-Digest and maybe CMS.

 Nick




Re: HyperLogLogUDT

2015-07-01 Thread Nick Pentreath
Any thoughts?



—
Sent from Mailbox

On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 Hey Spark devs
 I've been looking at DF UDFs and UDAFs. The approx distinct is using
 hyperloglog,
 but there is only an option to return the count as a Long.
 It can be useful to be able to return and store the actual data structure
 (ie serialized HLL). This effectively allows one to do aggregation /
 rollups over columns while still preserving the ability to get distinct
 counts.
 For example, one can store daily aggregates of events, grouped by various
 columns, while storing for each grouping the HLL of say unique users. So
 you can get the uniques per day directly but could also very easily do
 arbitrary aggregates (say monthly, annually) and still be able to get a
 unique count for that period by merging the daily HLLS.
 I did this a while back as a Hive UDAF (https://github.com/MLnick/hive-udf)
 which returns a Struct field containing a cardinality field and a
 binary field containing the serialized HLL.
 I was wondering if there would be interest in something like this? I am not
 so clear on how UDTs work with regards to SerDe - so could one adapt the
 HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as
 count as a field? Then I assume this would automatically play nicely with
 DataFrame I/O etc. The gotcha is one needs to then call
 approx_count_field.count (or is there a concept of a default field for
 a Struct?).
 Also, being able to provide the bitsize parameter may be useful...
 The same thinking would apply potentially to other approximate (and
 mergeable) data structures like T-Digest and maybe CMS.
 Nick

HyperLogLogUDT

2015-06-23 Thread Nick Pentreath
Hey Spark devs

I've been looking at DF UDFs and UDAFs. The approx distinct is using
hyperloglog,
but there is only an option to return the count as a Long.

It can be useful to be able to return and store the actual data structure
(ie serialized HLL). This effectively allows one to do aggregation /
rollups over columns while still preserving the ability to get distinct
counts.

For example, one can store daily aggregates of events, grouped by various
columns, while storing for each grouping the HLL of say unique users. So
you can get the uniques per day directly but could also very easily do
arbitrary aggregates (say monthly, annually) and still be able to get a
unique count for that period by merging the daily HLLS.

I did this a while back as a Hive UDAF (https://github.com/MLnick/hive-udf)
which returns a Struct field containing a cardinality field and a
binary field containing the serialized HLL.

I was wondering if there would be interest in something like this? I am not
so clear on how UDTs work with regards to SerDe - so could one adapt the
HyperLogLogUDT to be a Struct with the serialized HLL as a field as well as
count as a field? Then I assume this would automatically play nicely with
DataFrame I/O etc. The gotcha is one needs to then call
approx_count_field.count (or is there a concept of a default field for
a Struct?).

Also, being able to provide the bitsize parameter may be useful...

The same thinking would apply potentially to other approximate (and
mergeable) data structures like T-Digest and maybe CMS.

Nick


Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-06-18 Thread Nick Pentreath
If it's going into the DataFrame API (which it probably should rather than
in RDD itself) - then it could become a UDT (similar to HyperLogLogUDT)
which would mean it doesn't have to implement Serializable, as it appears
that serialization is taken care of in the UDT def (e.g.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala#L254
)

If I understand correctly UDT SerDe correctly?

On Thu, Jun 11, 2015 at 2:47 AM, Ray Ortigas rorti...@linkedin.com.invalid
wrote:

 Hi Grega and Reynold,

 Grega, if you still want to use t-digest, I filed this PR because I
 thought your t-digest suggestion was a good idea.

 https://github.com/tdunning/t-digest/pull/56

 If it is helpful feel free to do whatever with it.

 Regards,
 Ray


 On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin r...@databricks.com wrote:

 This email is good. Just one note -- a lot of people are swamped right
 before Spark Summit, so you might not get prompt responses this week.


 On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret gr...@celtra.com wrote:

 I have some time to work on it now. What's a good way to continue the
 discussions before coding it?

 This e-mail list, JIRA or something else?

 On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin r...@databricks.com
 wrote:

 I think those are great to have. I would put them in the DataFrame API
 though, since this is applying to structured data. Many of the advanced
 functions on the PairRDDFunctions should really go into the DataFrame API
 now we have it.

 One thing that would be great to understand is what state-of-the-art
 alternatives are out there. I did a quick google scholar search using the
 keyword approximate quantile and found some older papers. Just the
 first few I found:

 http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs


 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513rep=rep1type=pdf
  by Bruce Lindsay, IBM

 http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf





 On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret gr...@celtra.com
 wrote:

 Hi!

 I'd like to get community's opinion on implementing a generic quantile
 approximation algorithm for Spark that is O(n) and requires limited 
 memory.
 I would find it useful and I haven't found any existing implementation. 
 The
 plan was basically to wrap t-digest
 https://github.com/tdunning/t-digest, implement the
 serialization/deserialization boilerplate and provide

 def cdf(x: Double): Double
 def quantile(q: Double): Double


 on RDD[Double] and RDD[(K, Double)].

 Let me know what you think. Any other ideas/suggestions also welcome!

 Best,
 Grega
 --
 [image: Inline image 1]*Grega Kešpret*
 Senior Software Engineer, Analytics

 Skype: gregakespret
 celtra.com http://www.celtra.com/ | @celtramobile
 http://www.twitter.com/celtramobile








Re: [sample code] deeplearning4j for Spark ML (@DeveloperAPI)

2015-06-10 Thread Nick Pentreath
Looks very interesting, thanks for sharing this.

I haven't had much chance to do more than a quick glance over the code.
Quick question - are the Word2Vec and GLOVE implementations fully parallel
on Spark?

On Mon, Jun 8, 2015 at 6:20 PM, Eron Wright ewri...@live.com wrote:


 The deeplearning4j framework provides a variety of distributed, neural
 network-based learning algorithms, including convolutional nets, deep
 auto-encoders, deep-belief nets, and recurrent nets.  We’re working on
 integration with the Spark ML pipeline, leveraging the developer API.
 This announcement is to share some code and get feedback from the Spark
 community.

 The integration code is located in the dl4j-spark-ml module
 https://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j-scaleout/spark/dl4j-spark-ml
  in
 the deeplearning4j repository.

 Major aspects of the integration work:

1. *ML algorithms.*  To bind the dl4j algorithms to the ML pipeline,
we developed a new classifier

 https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/classification/MultiLayerNetworkClassification.scala
  and
a new unsupervised learning estimator

 https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/Unsupervised.scala.

2. *ML attributes.* We strove to interoperate well with other pipeline
components.   ML Attributes are column-level metadata enabling information
sharing between pipeline components.See here

 https://github.com/deeplearning4j/deeplearning4j/blob/4d33302dd8a792906050eda82a7d50ff77a8d957/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/ml/classification/MultiLayerNetworkClassification.scala#L89
  how
the classifier reads label metadata from a column provided by the new
StringIndexer

 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer
.
3. *Large binary data.*  It is challenging to work with large binary
data in Spark.   An effective approach is to leverage PrunedScan and to
carefully control partition sizes.  Here

 https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/sources/lfw/LfwRelation.scala
  we
explored this with a custom data source based on the new relation API.
4. *Column-based record readers.*  Here

 https://github.com/deeplearning4j/deeplearning4j/blob/b237385b56d42d24bd3c99d1eece6cb658f387f2/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/sources/lfw/LfwRelation.scala#L96
  we
explored how to construct rows from a Hadoop input split by composing a
number of column-level readers, with pruning support.
5. *UDTs*.   With Spark SQL it is possible to introduce new data
types.   We prototyped an experimental Tensor type, here

 https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/main/scala/org/deeplearning4j/spark/sql/types/tensors.scala
.
6. *Spark Package.*   We developed a spark package to make it easy to
use the dl4j framework in spark-shell and with spark-submit.  See the
deeplearning4j/dl4j-spark-ml
https://github.com/deeplearning4j/dl4j-spark-ml repository for
useful snippets involving the sbt-spark-package plugin.
7. *Example code.*   Examples demonstrate how the standardized ML API
simplifies interoperability, such as with label preprocessing and feature
scaling.   See the deeplearning4j/dl4j-spark-ml-examples
https://github.com/deeplearning4j/dl4j-spark-ml-examples repository
for an expanding set of example pipelines.

 Hope this proves useful to the community as we transition to exciting new
 concepts in Spark SQL and Spark ML.   Meanwhile, we have Spark working
 with multiple GPUs on AWS http://deeplearning4j.org/gpu_aws.html and
 we're looking forward to optimizations that will speed neural net training
 even more.

 Eron Wright
 Contributor | deeplearning4j.org




Re: [discuss] ending support for Java 6?

2015-05-01 Thread Nick Pentreath
+1 for this think it's high time.




We should of course do it with enough warning for users. 1.4 May be too early 
(not for me though!). Perhaps we specify that 1.5 will officially move to JDK7?









—
Sent from Mailbox

On Fri, May 1, 2015 at 12:16 AM, Ram Sriharsha
harsh...@yahoo-inc.com.invalid wrote:

 +1 for end of support for Java 6 
  On Thursday, April 30, 2015 3:08 PM, Vinod Kumar Vavilapalli 
 vino...@hortonworks.com wrote:

  FYI, after enough consideration, we the Hadoop community dropped support for 
 JDK 6 starting release Apache Hadoop 2.7.x.
 Thanks
 +Vinod
 On Apr 30, 2015, at 12:02 PM, Reynold Xin r...@databricks.com wrote:
 This has been discussed a few times in the past, but now Oracle has ended
 support for Java 6 for over a year, I wonder if we should just drop Java 6
 support.
 
 There is one outstanding issue Tom has brought to my attention: PySpark on
 YARN doesn't work well with Java 7/8, but we have an outstanding pull
 request to fix that.
 
 https://issues.apache.org/jira/browse/SPARK-6869
 https://issues.apache.org/jira/browse/SPARK-1920
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


Re: hadoop input/output format advanced control

2015-03-24 Thread Nick Pentreath
Imran, on your point to read multiple files together in a partition, is it
not simpler to use the approach of copy Hadoop conf and set per-RDD
settings for min split to control the input size per partition, together
with something like CombineFileInputFormat?

On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid iras...@cloudera.com wrote:

 I think this would be a great addition, I totally agree that you need to be
 able to set these at a finer context than just the SparkContext.

 Just to play devil's advocate, though -- the alternative is for you just
 subclass HadoopRDD yourself, or make a totally new RDD, and then you could
 expose whatever you need.  Why is this solution better?  IMO the criteria
 are:
 (a) common operations
 (b) error-prone / difficult to implement
 (c) non-obvious, but important for performance

 I think this case fits (a)  (c), so I think its still worthwhile.  But its
 also worth asking whether or not its too difficult for a user to extend
 HadoopRDD right now.  There have been several cases in the past week where
 we've suggested that a user should read from hdfs themselves (eg., to read
 multiple files together in one partition) -- with*out* reusing the code in
 HadoopRDD, though they would lose things like the metric tracking 
 preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
 refactoring to make that easier to do?  Or do we just need a good example?

 Imran

 (sorry for hijacking your thread, Koert)



 On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers ko...@tresata.com wrote:

  see email below. reynold suggested i send it to dev instead of user
 
  -- Forwarded message --
  From: Koert Kuipers ko...@tresata.com
  Date: Mon, Mar 23, 2015 at 4:36 PM
  Subject: hadoop input/output format advanced control
  To: u...@spark.apache.org u...@spark.apache.org
 
 
  currently its pretty hard to control the Hadoop Input/Output formats used
  in Spark. The conventions seems to be to add extra parameters to all
  methods and then somewhere deep inside the code (for example in
  PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
 into
  settings on the Hadoop Configuration object.
 
  for example for compression i see codec: Option[Class[_ :
  CompressionCodec]] = None added to a bunch of methods.
 
  how scalable is this solution really?
 
  for example i need to read from a hadoop dataset and i dont want the
 input
  (part) files to get split up. the way to do this is to set
  mapred.min.split.size. now i dont want to set this at the level of the
  SparkContext (which can be done), since i dont want it to apply to input
  formats in general. i want it to apply to just this one specific input
  dataset i need to read. which leaves me with no options currently. i
 could
  go add yet another input parameter to all the methods
  (SparkContext.textFile, SparkContext.hadoopFile, SparkContext.objectFile,
  etc.). but that seems ineffective.
 
  why can we not expose a Map[String, String] or some other generic way to
  manipulate settings for hadoop input/output formats? it would require
  adding one more parameter to all methods to deal with hadoop input/output
  formats, but after that its done. one parameter to rule them all
 
  then i could do:
  val x = sc.textFile(/some/path, formatSettings =
  Map(mapred.min.split.size - 12345))
 
  or
  rdd.saveAsTextFile(/some/path, formatSettings =
  Map(mapred.output.compress - true, mapred.output.compression.codec
 -
  somecodec))
 



Re: Directly broadcasting (sort of) RDDs

2015-03-21 Thread Nick Pentreath
There is block matrix in Spark 1.3 - 
http://spark.apache.org/docs/latest/mllib-data-types.html#blockmatrix





However I believe it only supports dense matrix blocks.




Still, might be possible to use it or exetend 




JIRAs:


https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3434





Was based on 


https://github.com/amplab/ml-matrix





Another lib:


https://github.com/PasaLab/marlin/blob/master/README.md







—
Sent from Mailbox

On Sat, Mar 21, 2015 at 12:24 AM, Guillaume Pitel
guillaume.pi...@exensa.com wrote:

 Hi,
 I have an idea that I would like to discuss with the Spark devs. The 
 idea comes from a very real problem that I have struggled with since 
 almost a year. My problem is very simple, it's a dense matrix * sparse 
 matrix  operation. I have a dense matrix RDD[(Int,FloatMatrix)] which is 
 divided in X large blocks (one block per partition), and a sparse matrix 
 RDD[((Int,Int),Array[Array[(Int,Float)]]] , divided in X * Y blocks. The 
 most efficient way to perform the operation is to collectAsMap() the 
 dense matrix and broadcast it, then perform the block-local 
 mutliplications, and combine the results by column.
 This is quite fine, unless the matrix is too big to fit in memory 
 (especially since the multiplication is performed several times 
 iteratively, and the broadcasts are not always cleaned from memory as I 
 would naively expect).
 When the dense matrix is too big, a second solution is to split the big 
 sparse matrix in several RDD, and do several broadcasts. Doing this 
 creates quite a big overhead, but it mostly works, even though I often 
 face some problems with unaccessible broadcast files, for instance.
 Then there is the terrible but apparently very effective good old join. 
 Since X blocks of the sparse matrix use the same block from the dense 
 matrix, I suspect that the dense matrix is somehow replicated X times 
 (either on disk or in the network), which is the reason why the join 
 takes so much time.
 After this bit of a context, here is my idea : would it be possible to 
 somehow broadcast (or maybe more accurately, share or serve) a 
 persisted RDD which is distributed on all workers, in a way that would, 
 a bit like the IndexedRDD, allow a task to access a partition or an 
 element of a partition in the closure, with a worker-local memory cache 
 . i.e. the information about where each block resides would be 
 distributed on the workers, to allow them to access parts of the RDD 
 directly. I think that's already a bit how RDD are shuffled ?
 The RDD could stay distributed (no need to collect then broadcast), and 
 only necessary transfers would be required.
 Is this a bad idea, is it already implemented somewhere (I would love it 
 !) ?or is it something that could add efficiency not only for my use 
 case, but maybe for others ? Could someone give me some hint about how I 
 could add this possibility to Spark ? I would probably try to extend a 
 RDD into a specific SharedIndexedRDD with a special lookup that would be 
 allowed from tasks as a special case, and that would try to contact the 
 blockManager and reach the corresponding data from the right worker.
 Thanks in advance for your advices
 Guillaume
 -- 
 eXenSa
   
 *Guillaume PITEL, Président*
 +33(0)626 222 431
 eXenSa S.A.S. http://www.exensa.com/
 41, rue Périer - 92120 Montrouge - FRANCE
 Tel +33(0)184 163 677 / Fax +33(0)972 283 705

Re: Welcoming three new committers

2015-02-04 Thread Nick Pentreath
Congrats and welcome Sean, Joseph and Cheng!


On Wed, Feb 4, 2015 at 2:10 PM, Sean Owen so...@cloudera.com wrote:

 Thanks all, I appreciate the vote of trust. I'll do my best to help
 keep JIRA and commits moving along, and am ramping up carefully this
 week. Now get back to work reviewing things!

 On Tue, Feb 3, 2015 at 4:34 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Hi all,
 
  The PMC recently voted to add three new committers: Cheng Lian, Joseph
 Bradley and Sean Owen. All three have been major contributors to Spark in
 the past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many
 pieces throughout Spark Core. Join me in welcoming them as committers!
 
  Matei

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: python converter in HBaseConverter.scala(spark/examples)

2015-01-05 Thread Nick Pentreath
Absolutely; as I mentioned by all means submit a PR - I just wanted to point 
out that any specific converter is not officially supported, although the 
interface is of course.


I'm happy to review a PR just ping me when ready.


—
Sent from Mailbox

On Mon, Jan 5, 2015 at 7:06 PM, Ted Yu yuzhih...@gmail.com wrote:

 HBaseConverter is in Spark source tree. Therefore I think it makes sense
 for this improvement to be accepted so that the example is more useful.
 Cheers
 On Mon, Jan 5, 2015 at 7:54 AM, Nick Pentreath nick.pentre...@gmail.com
 wrote:
 Hey

 These converters are actually just intended to be examples of how to set
 up a custom converter for a specific input format. The converter interface
 is there to provide flexibility where needed, although with the new
 SparkSQL data store interface the intention is that most common use cases
 can be handled using that approach rather than custom converters.

 The intention is not to have specific converters living in Spark core,
 which is why these are in the examples project.

 Having said that, if you wish to expand the example converter for others
 reference do feel free to submit a PR.

 Ideally though, I would think that various custom converters would be part
 of external projects that can be listed with http://spark-packages.org/ I
 see your project is already listed there.

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Mon, Jan 5, 2015 at 5:37 PM, Ted Yu yuzhih...@gmail.com wrote:

 In my opinion this would be useful - there was another thread where
 returning
 only the value of first column in the result was mentioned.

 Please create a SPARK JIRA and a pull request.

 Cheers

 On Mon, Jan 5, 2015 at 6:42 AM, tgbaggio gen.tan...@gmail.com wrote:

  Hi,
 
  In HBaseConverter.scala
  
 
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
  
  , the python converter HBaseResultToStringConverter return only the
 value
  of
  first column in the result. In my opinion, it limits the utility of
 this
  converter, because it returns only one value per row and moreover it
 loses
  the other information of record, such as column:cell, timestamp.
 
  Therefore, I would like to propose some modifications about
  HBaseResultToStringConverter which will be able to return all records
 in
  the
  hbase with more complete information: I have already written some code
 in
  pythonConverters.scala
  
 
 https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala
  
  and it works
 
  Is it OK to modify the code in HBaseConverters.scala, please?
  Thanks a lot in advance.
 
  Cheers
  Gen
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/python-converter-in-HBaseConverter-scala-spark-examples-tp10001.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 




Re: python converter in HBaseConverter.scala(spark/examples)

2015-01-05 Thread Nick Pentreath
Hey 


These converters are actually just intended to be examples of how to set up a 
custom converter for a specific input format. The converter interface is there 
to provide flexibility where needed, although with the new SparkSQL data store 
interface the intention is that most common use cases can be handled using that 
approach rather than custom converters.




The intention is not to have specific converters living in Spark core, which is 
why these are in the examples project.




Having said that, if you wish to expand the example converter for others 
reference do feel free to submit a PR.




Ideally though, I would think that various custom converters would be part of 
external projects that can be listed with http://spark-packages.org/ I see your 
project is already listed there.


—
Sent from Mailbox

On Mon, Jan 5, 2015 at 5:37 PM, Ted Yu yuzhih...@gmail.com wrote:

 In my opinion this would be useful - there was another thread where returning
 only the value of first column in the result was mentioned.
 Please create a SPARK JIRA and a pull request.
 Cheers
 On Mon, Jan 5, 2015 at 6:42 AM, tgbaggio gen.tan...@gmail.com wrote:
 Hi,

 In  HBaseConverter.scala
 
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
 
 , the python converter HBaseResultToStringConverter return only the value
 of
 first column in the result. In my opinion, it limits the utility of this
 converter, because it returns only one value per row and moreover it loses
 the other information of record, such as column:cell, timestamp.

 Therefore, I would like to propose some modifications about
 HBaseResultToStringConverter which will be able to return all records in
 the
 hbase with more complete information: I have already written some code in
 pythonConverters.scala
 
 https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala
 
 and it works

 Is it OK to modify the code in HBaseConverters.scala, please?
 Thanks a lot in advance.

 Cheers
 Gen




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/python-converter-in-HBaseConverter-scala-spark-examples-tp10001.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



Re: Highly interested in contributing to spark

2015-01-01 Thread Nick Pentreath
Oh actually I was confused with another project, yours was not LSH sorry!






—
Sent from Mailbox

On Fri, Jan 2, 2015 at 8:19 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 I'm sure Spark will sign up for GSoC again this year - and id be surprised if 
 there was not some interest now for projects :)
 If I have the time at that point in the year I'd be happy to mentor a project 
 in MLlib but will have to see how my schedule is at that point!
 Manoj perhaps some of the locality sensitive hashing stuff you did for 
 scikit-learn could find its way to Spark or spark-projects.
 —
 Sent from Mailbox
 On Fri, Jan 2, 2015 at 6:28 AM, Reynold Xin r...@databricks.com wrote:
 Hi Manoj,
 Thanks for the email.
 Yes - you should start with the starter task before attempting larger ones.
 Last year I signed up as a mentor for GSoC, but no student signed up. I
 don't think I'd have time to be a mentor this year, but others might.
 On Thu, Jan 1, 2015 at 4:54 PM, Manoj Kumar manojkumarsivaraj...@gmail.com
 wrote:
 Hello,

 I am Manoj (https://github.com/MechCoder), an undergraduate student highly
 interested in Machine Learning. I have contributed to SymPy and
 scikit-learn as part of Google Summer of Code projects and my bachelor's
 thesis. I have a few quick (non-technical) questions before I dive into the
 issue tracker.

 Are the ones marked trivial easy to fix ones, that I could try before
 attempting slightly more ambitious ones? Also I would like to know if
 Apache Spark takes part in Google Summer of Code projects under the Apache
 Software Foundation. It would be really great if it does!

 Looking forward!

 --
 Godspeed,
 Manoj Kumar,
 Mech Undergrad
 http://manojbits.wordpress.com


Re: Highly interested in contributing to spark

2015-01-01 Thread Nick Pentreath
I'm sure Spark will sign up for GSoC again this year - and id be surprised if 
there was not some interest now for projects :)


If I have the time at that point in the year I'd be happy to mentor a project 
in MLlib but will have to see how my schedule is at that point!




Manoj perhaps some of the locality sensitive hashing stuff you did for 
scikit-learn could find its way to Spark or spark-projects.


—
Sent from Mailbox

On Fri, Jan 2, 2015 at 6:28 AM, Reynold Xin r...@databricks.com wrote:

 Hi Manoj,
 Thanks for the email.
 Yes - you should start with the starter task before attempting larger ones.
 Last year I signed up as a mentor for GSoC, but no student signed up. I
 don't think I'd have time to be a mentor this year, but others might.
 On Thu, Jan 1, 2015 at 4:54 PM, Manoj Kumar manojkumarsivaraj...@gmail.com
 wrote:
 Hello,

 I am Manoj (https://github.com/MechCoder), an undergraduate student highly
 interested in Machine Learning. I have contributed to SymPy and
 scikit-learn as part of Google Summer of Code projects and my bachelor's
 thesis. I have a few quick (non-technical) questions before I dive into the
 issue tracker.

 Are the ones marked trivial easy to fix ones, that I could try before
 attempting slightly more ambitious ones? Also I would like to know if
 Apache Spark takes part in Google Summer of Code projects under the Apache
 Software Foundation. It would be really great if it does!

 Looking forward!

 --
 Godspeed,
 Manoj Kumar,
 Mech Undergrad
 http://manojbits.wordpress.com


Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-13 Thread Nick Pentreath
+1

—
Sent from Mailbox

On Sat, Dec 13, 2014 at 3:12 PM, GuoQiang Li wi...@qq.com wrote:

 +1 (non-binding).  Tested on CentOS 6.4
 -- Original --
 From:  Patrick Wendell;pwend...@gmail.com;
 Date:  Thu, Dec 11, 2014 05:08 AM
 To:  dev发送@spark.apache.orgdev@spark.apache.org;
 Subject:  [VOTE] Release Apache Spark 1.2.0 (RC2)
 Please vote on releasing the following candidate as Apache Spark version 
 1.2.0!
 The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.0-rc2/
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1055/
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
 Please vote on releasing this package as Apache Spark 1.2.0!
 The vote is open until Saturday, December 13, at 21:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.
 [ ] +1 Release this package as Apache Spark 1.2.0
 [ ] -1 Do not release this package because ...
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 == What justifies a -1 vote for this release? ==
 This vote is happening relatively late into the QA period, so
 -1 votes should only occur for significant regressions from
 1.0.2. Bugs already present in 1.1.X, minor
 regressions, or bugs related to new features will not block this
 release.
 == What default changes should I be aware of? ==
 1. The default value of spark.shuffle.blockTransferService has been
 changed to netty
 -- Old behavior can be restored by switching to nio
 2. The default value of spark.shuffle.manager has been changed to sort.
 -- Old behavior can be restored by setting spark.shuffle.manager to hash.
 == How does this differ from RC1 ==
 This has fixes for a handful of issues identified - some of the
 notable fixes are:
 [Core]
 SPARK-4498: Standalone Master can fail to recognize completed/failed
 applications
 [SQL]
 SPARK-4552: Query for empty parquet table in spark sql hive get
 IllegalArgumentException
 SPARK-4753: Parquet2 does not prune based on OR filters on partition columns
 SPARK-4761: With JDBC server, set Kryo as default serializer and
 disable reference tracking
 SPARK-4785: When called with arguments referring column fields, PMOD throws 
 NPE
 - Patrick
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Nick Pentreath
+1 (binding)

—
Sent from Mailbox

On Thu, Nov 6, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com
wrote:

 +1
 The app to track PRs based on component is a great idea...
 On Thu, Nov 6, 2014 at 8:47 AM, Sean McNamara sean.mcnam...@webtrends.com
 wrote:
 +1

 Sean

 On Nov 5, 2014, at 6:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

  Hi all,
 
  I wanted to share a discussion we've been having on the PMC list, as
 well as call for an official vote on it on a public list. Basically, as the
 Spark project scales up, we need to define a model to make sure there is
 still great oversight of key components (in particular internal
 architecture and public APIs), and to this end I've proposed implementing a
 maintainer model for some of these components, similar to other large
 projects.
 
  As background on this, Spark has grown a lot since joining Apache. We've
 had over 80 contributors/month for the past 3 months, which I believe makes
 us the most active project in contributors/month at Apache, as well as over
 500 patches/month. The codebase has also grown significantly, with new
 libraries for SQL, ML, graphs and more.
 
  In this kind of large project, one common way to scale development is to
 assign maintainers to oversee key components, where each patch to that
 component needs to get sign-off from at least one of its maintainers. Most
 existing large projects do this -- at Apache, some large ones with this
 model are CloudStack (the second-most active project overall), Subversion,
 and Kafka, and other examples include Linux and Python. This is also
 by-and-large how Spark operates today -- most components have a de-facto
 maintainer.
 
  IMO, adopting this model would have two benefits:
 
  1) Consistent oversight of design for that component, especially
 regarding architecture and API. This process would ensure that the
 component's maintainers see all proposed changes and consider them to fit
 together in a good way.
 
  2) More structure for new contributors and committers -- in particular,
 it would be easy to look up who’s responsible for each module and ask them
 for reviews, etc, rather than having patches slip between the cracks.
 
  We'd like to start with in a light-weight manner, where the model only
 applies to certain key components (e.g. scheduler, shuffle) and user-facing
 APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand
 it if we deem it useful. The specific mechanics would be as follows:
 
  - Some components in Spark will have maintainers assigned to them, where
 one of the maintainers needs to sign off on each patch to the component.
  - Each component with maintainers will have at least 2 maintainers.
  - Maintainers will be assigned from the most active and knowledgeable
 committers on that component by the PMC. The PMC can vote to add / remove
 maintainers, and maintained components, through consensus.
  - Maintainers are expected to be active in responding to patches for
 their components, though they do not need to be the main reviewers for them
 (e.g. they might just sign off on architecture / API). To prevent inactive
 maintainers from blocking the project, if a maintainer isn't responding in
 a reasonable time period (say 2 weeks), other committers can merge the
 patch, and the PMC will want to discuss adding another maintainer.
 
  If you'd like to see examples for this model, check out the following
 projects:
  - CloudStack:
 https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
 
 https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
 
  - Subversion:
 https://subversion.apache.org/docs/community-guide/roles.html 
 https://subversion.apache.org/docs/community-guide/roles.html
 
  Finally, I wanted to list our current proposal for initial components
 and maintainers. It would be good to get feedback on other components we
 might add, but please note that personnel discussions (e.g. I don't think
 Matei should maintain *that* component) should only happen on the private
 list. The initial components were chosen to include all public APIs and the
 main core components, and the maintainers were chosen from the most active
 contributors to those modules.
 
  - Spark core public API: Matei, Patrick, Reynold
  - Job scheduler: Matei, Kay, Patrick
  - Shuffle and network: Reynold, Aaron, Matei
  - Block manager: Reynold, Aaron
  - YARN: Tom, Andrew Or
  - Python: Josh, Matei
  - MLlib: Xiangrui, Matei
  - SQL: Michael, Reynold
  - Streaming: TD, Matei
  - GraphX: Ankur, Joey, Reynold
 
  I'd like to formally call a [VOTE] on this model, to last 72 hours. The
 [VOTE] will end on Nov 8, 2014 at 6 PM PST.
 
  Matei


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



Re: matrix factorization cross validation

2014-10-30 Thread Nick Pentreath
Looking at
https://github.com/apache/spark/blob/814a9cd7fabebf2a06f7e2e5d46b6a2b28b917c2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L82

For each user in test set, you generate an Array of top K predicted item
ids (Int or String probably), and an Array of ground truth item ids (the
known rated or liked items in the test set for that user), and pass that to
precisionAt(k) to compute MAP@k (Actually this method name is a bit
misleading - it should be meanAveragePrecisionAt where the other method
there is without a cutoff at k. However, both compute MAP).

The challenge at scale is actually computing all the top Ks for each user,
as it requires broadcasting all the item factors (unless there is a smarter
way?)

I wonder if it is possible to extend the DIMSUM idea to computing top K
matrix multiply between the user and item factor matrices, as opposed to
all-pairs similarity of one matrix?

On Thu, Oct 30, 2014 at 5:28 AM, Debasish Das debasish.da...@gmail.com
wrote:

 Is there an example of how to use RankingMetrics ?

 Let's take the user, document example...we get user x topic and document x
 topic matrices as the model...

 Now for each user, we can generate topK document by doing a sort on (1 x
 topic)dot(topic x document) and picking topK...

 Is it possible to validate such a topK finding algorithm using
 RankingMetrics ?


 On Wed, Oct 29, 2014 at 12:14 PM, Xiangrui Meng men...@gmail.com wrote:

  Let's narrow the context from matrix factorization to recommendation
  via ALS. It adds extra complexity if we treat it as a multi-class
  classification problem. ALS only outputs a single value for each
  prediction, which is hard to convert to probability distribution over
  the 5 rating levels. Treating it as a binary classification problem or
  a ranking problem does make sense. The RankingMetricc is in master.
  Free free to add prec@k and ndcg@k to examples.MovielensALS. ROC
  should be good to add as well. -Xiangrui
 
 
  On Wed, Oct 29, 2014 at 11:23 AM, Debasish Das debasish.da...@gmail.com
 
  wrote:
   Hi,
  
   In the current factorization flow, we cross validate on the test
 dataset
   using the RMSE number but there are some other measures which are worth
   looking into.
  
   If we consider the problem as a regression problem and the ratings 1-5
  are
   considered as 5 classes, it is possible to generate a confusion matrix
   using MultiClassMetrics.scala
  
   If the ratings are only 0/1 (like from the spotify demo from spark
  summit)
   then it is possible to use Binary Classification Metrices to come up
 with
   the ROC curve...
  
   For topK user/products we should also look into prec@k and pdcg@k as
 the
   metric..
  
   Does it make sense to add the multiclass metric and prec@k, pdcg@k in
   examples.MovielensALS along with RMSE ?
  
   Thanks.
   Deb
 



Re: matrix factorization cross validation

2014-10-30 Thread Nick Pentreath
Sean, re my point earlier do you know a more efficient way to compute top k for 
each user, other than to broadcast the item factors? 


(I guess one can use the new asymmetric lsh paper perhaps to assist)


—
Sent from Mailbox

On Thu, Oct 30, 2014 at 11:24 PM, Sean Owen so...@cloudera.com wrote:

 MAP is effectively an average over all k from 1 to min(#
 recommendations, # items rated) Getting first recommendations right is
 more important than the last.
 On Thu, Oct 30, 2014 at 10:21 PM, Debasish Das debasish.da...@gmail.com 
 wrote:
 Does it make sense to have a user specific K or K is considered same over
 all users ?

 Intuitively the users who watches more movies should get a higher K than the
 others...


Re: Oryx + Spark mllib

2014-10-19 Thread Nick Pentreath
We've built a model server internally, based on Scalatra and Akka
Clustering. Our use case is more geared towards serving possibly thousands
of smaller models.

It's actually very basic, just reads models from S3 as strings (!!) (uses
HDFS FileSystem so can read from local, HDFS, S3) and uses Breeze for
linear algebra. (Technically it is also not dependent on Spark, it could be
reading models generated by any computation layer).

It's designed to allow scaling via cluster sharding, by adding nodes (but
could also support a load-balanced approach). Not using persistent actors
as doing a model reload on node failure is not a disaster as we have
multiple levels of fallback.

Currently it is a bit specific to our setup (and only focused on
recommendation models for now), but could with some work be made generic.
I'm certainly considering if we can find the time to make it a releasable
project.

One major difference to Oryx is that it only handles the model loading and
vector computations, not the filtering-related and other things that come
as part of a recommender system (that is done elsewhere in our system). It
also does not handle the ingesting of data at all.

On Sun, Oct 19, 2014 at 7:10 AM, Sean Owen so...@cloudera.com wrote:

 Yes, that is exactly what the next 2.x version does. Still in progress but
 the recommender app and framework are code - complete. It is not even
 specific to MLlib and could plug in other model build functions.

 The current 1.x version will not use MLlib. Neither uses Play but is
 intended to scale just by adding web servers however you usually do.

 See graphflow too.
 On Oct 18, 2014 5:06 PM, Rajiv Abraham rajiv.abra...@gmail.com wrote:

  Oryx 2 seems to be geared for Spark
 
  https://github.com/OryxProject/oryx
 
  2014-10-18 11:46 GMT-04:00 Debasish Das debasish.da...@gmail.com:
 
   Hi,
  
   Is someone working on a project on integrating Oryx model serving layer
   with Spark ? Models will be built using either Streaming data / Batch
  data
   in HDFS and cross validated with mllib APIs but the model serving layer
   will give API endpoints like Oryx
   and read the models may be from hdfs/impala/SparkSQL
  
   One of the requirement is that the API layer should be scalable and
   elastic...as requests grow we should be able to add more nodes...using
  play
   and akka clustering module...
  
   If there is a ongoing project on github please point to it...
  
   Is there a plan of adding model serving and experimentation layer to
  mllib
   ?
  
   Thanks.
   Deb
  
 
 
 
  --
  Take care,
  Rajiv
 



Re: Oryx + Spark mllib

2014-10-19 Thread Nick Pentreath
Well, when I started development ~2 years ago, Scalatra just appealed more,
being more lightweight (I didn't need MVC just barebones REST endpoints),
and I still find its API / DSL much nicer to work with. Also, the swagger
API docs integration was important to me. So it's more familiarity than any
other reason.

If I were to build a model server from scratch perhaps Spray/Akka HTTP
would be the better way to go purely for integration purposes.

Having said that I think Scalatra is great and performant, so it's not a
no-brainer either way.

On Sun, Oct 19, 2014 at 5:29 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Hi Nick,

 Any specific reason of choosing scalatra and not play/spray (now that they
 are getting integrated) ?

 Sean,

 Would you be interested in a play and akka clustering based module in
 oryx2 and see how it compares against the servlets ? I am interested to
 understand the scalability

 Thanks.
 Deb

 On Sat, Oct 18, 2014 at 11:22 PM, Nick Pentreath nick.pentre...@gmail.com
  wrote:

 We've built a model server internally, based on Scalatra and Akka
 Clustering. Our use case is more geared towards serving possibly thousands
 of smaller models.

 It's actually very basic, just reads models from S3 as strings (!!) (uses
 HDFS FileSystem so can read from local, HDFS, S3) and uses Breeze for
 linear algebra. (Technically it is also not dependent on Spark, it could be
 reading models generated by any computation layer).

 It's designed to allow scaling via cluster sharding, by adding nodes (but
 could also support a load-balanced approach). Not using persistent actors
 as doing a model reload on node failure is not a disaster as we have
 multiple levels of fallback.

 Currently it is a bit specific to our setup (and only focused on
 recommendation models for now), but could with some work be made generic.
 I'm certainly considering if we can find the time to make it a releasable
 project.

 One major difference to Oryx is that it only handles the model loading
 and vector computations, not the filtering-related and other things that
 come as part of a recommender system (that is done elsewhere in our
 system). It also does not handle the ingesting of data at all.

 On Sun, Oct 19, 2014 at 7:10 AM, Sean Owen so...@cloudera.com wrote:

 Yes, that is exactly what the next 2.x version does. Still in progress
 but
 the recommender app and framework are code - complete. It is not even
 specific to MLlib and could plug in other model build functions.

 The current 1.x version will not use MLlib. Neither uses Play but is
 intended to scale just by adding web servers however you usually do.

 See graphflow too.
 On Oct 18, 2014 5:06 PM, Rajiv Abraham rajiv.abra...@gmail.com
 wrote:

  Oryx 2 seems to be geared for Spark
 
  https://github.com/OryxProject/oryx
 
  2014-10-18 11:46 GMT-04:00 Debasish Das debasish.da...@gmail.com:
 
   Hi,
  
   Is someone working on a project on integrating Oryx model serving
 layer
   with Spark ? Models will be built using either Streaming data / Batch
  data
   in HDFS and cross validated with mllib APIs but the model serving
 layer
   will give API endpoints like Oryx
   and read the models may be from hdfs/impala/SparkSQL
  
   One of the requirement is that the API layer should be scalable and
   elastic...as requests grow we should be able to add more
 nodes...using
  play
   and akka clustering module...
  
   If there is a ongoing project on github please point to it...
  
   Is there a plan of adding model serving and experimentation layer to
  mllib
   ?
  
   Thanks.
   Deb
  
 
 
 
  --
  Take care,
  Rajiv
 






Re: Oryx + Spark mllib

2014-10-19 Thread Nick Pentreath
The shared-nothing load-balanced server architecture works for all but the
most massive models - and even then a few big EC2 r3 instances should do
the trick.

One nice thing about Akka (and especially the new HTTP) is fault tolerance,
recovery and potential for persistence.

For us arguably the sharding is somewhat overkill initially, but does allow
easy scaling in future where conceivably all models may not fit into single
machine memory.

On Sun, Oct 19, 2014 at 5:46 PM, Sean Owen so...@cloudera.com wrote:

 Briefly, re: Oryx2, since the intent is for users to write their own
 serving apps, I though JAX-RS would be more familiar to more
 developers. I don't know how hard/easy REST APIs are in JAX-RS vs
 anything else but I suspect it's not much different.

 The interesting design decision that impacts scale is: do you
 distribute scoring of each request across a cluster? the servlet-based
 design does not and does everything in-core, in-memory.

 Pros: Dead simple architecture. Hard to beat for low latency. Anything
 more complex is big overkill for most models (RDF, k-means) -- except
 recommenders.

 Cons: For recommenders, harder to scale since everything is in-memory.
 And that's a big but.

 On Sun, Oct 19, 2014 at 11:29 AM, Debasish Das debasish.da...@gmail.com
 wrote:
  Would you be interested in a play and akka clustering based module in
 oryx2
  and see how it compares against the servlets ? I am interested to
 understand
  the scalability



Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread Nick Pentreath
Might be worth checking out scikit-learn and mahout to get some broad ideas—
Sent from Mailbox

On Thu, Jul 10, 2014 at 4:25 PM, RJ Nowling rnowl...@gmail.com wrote:

 I went ahead and created JIRAs.
 JIRA for Hierarchical Clustering:
 https://issues.apache.org/jira/browse/SPARK-2429
 JIRA for Standarized Clustering APIs:
 https://issues.apache.org/jira/browse/SPARK-2430
 Before submitting a PR for the standardized API, I want to implement a
 few clustering algorithms for myself to get a good feel for how to
 structure such an API.  If others with more experience want to dive
 into designing the API, though, that would allow us to get moving more
 quickly.
 On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath nick.pentre...@gmail.com 
 wrote:
 Cool seems like a god initiative. Adding a couple extra high quality 
 clustering implantations will be great.


 I'd say it would make most sense to submit a PR for the Standardised API 
 first, agree that with everyone and then build on it for the specific 
 implementations.
 —
 Sent from Mailbox

 On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks everyone for the input.
 So it seems what people want is:
 * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
 conquer approach, look at DecisionTree implementation as a reference)
 * Restructure 3 Kmeans clustering algorithm implementations to prevent
 code duplication and conform to a consistent API where possible
 If this is correct, I'll start work on that.  How would it be best to
 structure it? Should I submit separate JIRAs / PRs for refactoring of
 current code, MiniBatch KMeans, and Hierarchical or keep my current
 JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
 for Hierarchical KMeans that builds on top of that?
 Thanks!
 On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote:
 Yeah if one were to replace the objective function in decision tree with
 minimizing the variance of the leaf nodes it would be a hierarchical
 clusterer.


 On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 If you're thinking along these lines, have a look at the DecisionTree
 implementation in MLlib. It uses the same idea and is optimized to prevent
 multiple passes over the data by computing several splits at each level of
 tree building. The tradeoff is increased model state and computation per
 pass over the data, but fewer total passes and hopefully lower
 communication overheads than, say, shuffling data around that belongs to
 one cluster or another. Something like that could work here as well.

 I'm not super-familiar with hierarchical K-Means so perhaps there's a more
 efficient way to implement it, though.


 On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

  No was thinking more top-down:
 
  assuming a distributed kmeans system already existing, recursively apply
  the kmeans algorithm on data already partitioned by the previous level 
  of
  kmeans.
 
  I haven't been much of a fan of bottom up approaches like HAC mainly
  because they assume there is already a distance metric for items to
 items.
  This makes it hard to cluster new content. The distances between sibling
  clusters is also hard to compute (if you have thrown away the similarity
  matrix), do you count paths to same parent node if you are computing
  distances between items in two adjacent nodes for example. It is also a
 bit
  harder to distribute the computation for bottom up approaches as you 
  have
  to already find the nearest neighbor to an item to begin the process.
 
 
  On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:
 
   The scikit-learn implementation may be of interest:
  
  
  
 
 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
  
   It's a bottom up approach.  The pair of clusters for merging are
   chosen to minimize variance.
  
   Their code is under a BSD license so it can be used as a template.
  
   Is something like that you were thinking Hector?
  
   On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
sure. more interesting problem here is choosing k at each level.
 Kernel
methods seem to be most promising.
   
   
On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
  wrote:
   
No idea, never looked it up. Always just implemented it as doing
  k-means
again on each cluster.
   
FWIW standard k-means with euclidean distance has problems too with
  some
dimensionality reduction methods. Swapping out the distance metric
  with
negative dot or cosine may help.
   
Other more useful clustering would be hierarchical SVD. The reason
  why I
like hierarchical clustering is it makes for faster inference
  especially
over billions of users.
   
   
On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 
wrote:
   
 Hector, could you share

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread Nick Pentreath
Cool seems like a god initiative. Adding a couple extra high quality clustering 
implantations will be great.


I'd say it would make most sense to submit a PR for the Standardised API first, 
agree that with everyone and then build on it for the specific implementations.
—
Sent from Mailbox

On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thanks everyone for the input.
 So it seems what people want is:
 * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and
 conquer approach, look at DecisionTree implementation as a reference)
 * Restructure 3 Kmeans clustering algorithm implementations to prevent
 code duplication and conform to a consistent API where possible
 If this is correct, I'll start work on that.  How would it be best to
 structure it? Should I submit separate JIRAs / PRs for refactoring of
 current code, MiniBatch KMeans, and Hierarchical or keep my current
 JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR
 for Hierarchical KMeans that builds on top of that?
 Thanks!
 On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote:
 Yeah if one were to replace the objective function in decision tree with
 minimizing the variance of the leaf nodes it would be a hierarchical
 clusterer.


 On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 If you're thinking along these lines, have a look at the DecisionTree
 implementation in MLlib. It uses the same idea and is optimized to prevent
 multiple passes over the data by computing several splits at each level of
 tree building. The tradeoff is increased model state and computation per
 pass over the data, but fewer total passes and hopefully lower
 communication overheads than, say, shuffling data around that belongs to
 one cluster or another. Something like that could work here as well.

 I'm not super-familiar with hierarchical K-Means so perhaps there's a more
 efficient way to implement it, though.


 On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote:

  No was thinking more top-down:
 
  assuming a distributed kmeans system already existing, recursively apply
  the kmeans algorithm on data already partitioned by the previous level of
  kmeans.
 
  I haven't been much of a fan of bottom up approaches like HAC mainly
  because they assume there is already a distance metric for items to
 items.
  This makes it hard to cluster new content. The distances between sibling
  clusters is also hard to compute (if you have thrown away the similarity
  matrix), do you count paths to same parent node if you are computing
  distances between items in two adjacent nodes for example. It is also a
 bit
  harder to distribute the computation for bottom up approaches as you have
  to already find the nearest neighbor to an item to begin the process.
 
 
  On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote:
 
   The scikit-learn implementation may be of interest:
  
  
  
 
 http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
  
   It's a bottom up approach.  The pair of clusters for merging are
   chosen to minimize variance.
  
   Their code is under a BSD license so it can be used as a template.
  
   Is something like that you were thinking Hector?
  
   On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
sure. more interesting problem here is choosing k at each level.
 Kernel
methods seem to be most promising.
   
   
On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com
  wrote:
   
No idea, never looked it up. Always just implemented it as doing
  k-means
again on each cluster.
   
FWIW standard k-means with euclidean distance has problems too with
  some
dimensionality reduction methods. Swapping out the distance metric
  with
negative dot or cosine may help.
   
Other more useful clustering would be hierarchical SVD. The reason
  why I
like hierarchical clustering is it makes for faster inference
  especially
over billions of users.
   
   
On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com
 
wrote:
   
 Hector, could you share the references for hierarchical K-means?
   thanks.


 On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com
   wrote:

  I would say for bigdata applications the most useful would be
 hierarchical
  k-means with back tracking and the ability to support k nearest
 centroids.
 
 
  On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling rnowl...@gmail.com
 
wrote:
 
   Hi all,
  
   MLlib currently has one clustering algorithm implementation,
   KMeans.
   It would benefit from having implementations of other
 clustering
   algorithms such as MiniBatch KMeans, Fuzzy C-Means,
 Hierarchical
   Clustering, and Affinity Propagation.
  
   I recently submitted a PR [1] for a MiniBatch 

  1   2   >