Re: Revisiting Online serving of Spark models?

2018-06-03 Thread Holden Karau
mp;source=g> >> >> Ground floor (outside of conference area - should be available for all) - >> we will meet and decide where to go >> >> (Would not send invite because that would be too much noise for dev@) >> >> To paraphrase Joseph, we will use this

Re: [build system] meet your build engineer @ spark ai summit SF 2018

2018-06-05 Thread Holden Karau
That's awesome! On Tue, Jun 5, 2018 at 12:23 PM, shane knapp wrote: > just a reminder to come meet your build engineer! > > we'll also be having a couple of demos of current projects in the lab: > tuesday (today) 130pm -- pandas on ray (https://rise.cs.berkeley.edu/ > blog/pandas-on-ray/) > wedn

Review notification bot

2018-06-06 Thread Holden Karau
Hi friends, Was chatting with some folks at the summit and I was wondering how people would feel about adding a review bot to ping folks. We already have the review dashboard but I was thinking we could ping folks who were the original authors of the code being changed whom might not be in the hab

Re: Scala 2.12 support

2018-06-06 Thread Holden Karau
Just chatted with Dean @ the summit and it sounds like from Adriaan there is a fix in 2.13 for the API change issue that could be back ported to 2.12 so how about we try and get this ball rolling? It sounds like it would also need a closure cleaner change, which could be backwards compatible but s

Re: Review notification bot

2018-06-06 Thread Holden Karau
> 100K-line PRs. So maybe some way to decline or silence is important, or > maybe just ping once and leave it. Sure, a bot that just adds a "Would @foo > like to review?" comment on Github? Sure seems worth trying if someone is > willing to do the work to cook up the bot

Re: Revisiting Online serving of Spark models?

2018-06-06 Thread Holden Karau
h < > nick.pentre...@gmail.com> escribió: > >> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it. >> >> On Sun, 3 Jun 2018 at 00:24 Holden Karau wrote: >> >>> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice < >>> maximilianofel

Re: Scala 2.12 support

2018-06-07 Thread Holden Karau
6:43 AM, Felix Cheung >> wrote: >> > >> > +1 >> > >> > Spoke to Dean as well and mentioned the problem with 2.11.12 >> https://github.com/scala/bug/issues/10913 >> > >> > _ >> > From: Sean Owen

Re: Scala 2.12 support

2018-06-07 Thread Holden Karau
pplications.csp>, > and other content from O'Reilly > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > https://github.com/deanwampler > > On Thu, Jun 7, 2018 at 5:09 PM, Holden Karau wrote: > >> If the difference is the order of the

Re: Scala 2.12 support

2018-06-07 Thread Holden Karau
ion. > > scala> Spark context Web UI available at http://192.168.1.169:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1528180279528). > Spark session available as 'spark’. > scala> > > DB Tsai | Siri Open Source Technologies [no

Re: Live Streamed Code Review today at 11am Pacific

2018-06-07 Thread Holden Karau
I'll be doing another one tomorrow morning at 9am pacific focused on Python + K8s support & improved JSON support - https://www.youtube.com/watch?v=Z7ZEkvNwneU & https://www.twitch.tv/events/xU90q9RGRGSOgp2LoNsf6A :) On Fri, Mar 9, 2018 at 3:54 PM, Holden Karau wrote: > If anyon

Re: Revisiting Online serving of Spark models?

2018-06-11 Thread Holden Karau
So I kicked of a thread on user@ to collect people's feedback there but I'll summarize the offline results later this week too. On Tue, Jun 12, 2018, 5:03 AM Liang-Chi Hsieh wrote: > > Hi, > > It'd be great if there can be any sharing of the offline discussion. >

Re: Live Streamed Code Review today at 11am Pacific

2018-06-14 Thread Holden Karau
d the other will be the regular Friday code review ( https://www.youtube.com/watch?v=IAWm4OLRoyY / https://www.twitch.tv/events/v0qzXxnNQ_K7a8JYFsIiKQ ) also at 9am. On Thu, Jun 7, 2018 at 9:10 PM, Holden Karau wrote: > I'll be doing another one tomorrow morning at 9am pacific focused on &

Re: Live Streamed Code Review today at 11am Pacific

2018-06-27 Thread Holden Karau
.com/user/holdenkarau & https://www.twitch.tv/holdenkarau/events . Hopefully this can encourage more folks to help with RC validation & PR reviews :) On Thu, Jun 14, 2018 at 6:07 AM, Holden Karau wrote: > Next week is pride in San Francisco but I'm still going to do two quick > sess

[ANNOUNCE] Apache Spark 2.1.3

2018-07-01 Thread Holden Karau
We are happy to announce the availability of Spark 2.1.3! Apache Spark 2.1.3 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. The release notes are available at http://spark.apache.org/releases/s

Re: [VOTE] Spark 2.2.2 (RC2)

2018-07-01 Thread Holden Karau
Leaving documents aside (I think we should maybe have a thread on how we want to handle doc changes to existing releases on dev@) I'm +1 PySpark venv checks out. On Sun, Jul 1, 2018 at 9:40 PM, Hyukjin Kwon wrote: > Let me leave a note about https://issues.apache. > org/jira/browse/SPARK-24530.

Re: Beam's recent community development work

2018-07-02 Thread Holden Karau
As someone who floats a bit between both projects (as a contributor) I'd love to see us adopt some of these techniques to be pro-active about growing our committer-ship (I think perhaps we could do this by also moving some of the newer committers into the PMC faster so there are more eyes out looki

Re: Live Streamed Code Review today at 11am Pacific

2018-07-13 Thread Holden Karau
ySpark and working on Sparkling ML - https://www.youtube.com/watch?v=kCnBDpNce9A&list=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw&index=32 On Wed, Jun 27, 2018 at 10:44 AM, Holden Karau wrote: > Today @ 1:30pm pacific I'll be looking at the current Spark 2.1.3 RC and > see how we validate S

Re: Review notification bot

2018-07-14 Thread Holden Karau
is on for Spark on a trial biases and we can revisit it based on how folks interact with it. On Wed, Jun 6, 2018 at 12:24 PM, Holden Karau wrote: > So there are a few bots along this line in OSS. If no one objects I’ll > take a look and find one which matches our use case and try it out.

Re: Pyspark access to scala/java libraries

2018-07-15 Thread Holden Karau
If you want to see some examples in a library shows a way to do it - https://github.com/sparklingpandas/sparklingml and high performance spark also talks about it. On Sun, Jul 15, 2018, 11:57 AM <0xf0f...@protonmail.com.invalid> wrote: > Check > https://stackoverflow.com/questions/31684842/callin

Re: Live Streamed Code Review today at 11am Pacific

2018-07-19 Thread Holden Karau
Heads up tomorrows Friday review is going to be at 8:30 am instead of 9:30 am because I had to move some flights around. On Fri, Jul 13, 2018 at 12:03 PM, Holden Karau wrote: > This afternoon @ 3pm pacific I'll be looking at review tooling for Spark & > Beam https://www.yout

Re: Review notification bot

2018-07-22 Thread Holden Karau
tion-bot (which Facebook is apparently no > longer maintaining <https://github.com/facebookarchive/mention-bot#readme>), > or if we can even use it given the ASF setup on GitHub. But I thought it > would be worth mentioning nonetheless. > > On Sat, Jul 14, 2018 at 11:17 AM Hol

Live Code Reviews, Coding, and Dev Tools

2018-07-24 Thread Holden Karau
Tomorrow afternoon @ 3pm pacific I'll be doing some dev tools poking for Beam and Spark - https://www.youtube.com/watch?v=6cTmC_fP9B0 for mention-bot. On Friday I'll be doing my normal code reviews - https://www.youtube.com/watch?v=O4rRx-3PTiM On Monday July 30th @ 9:30am I'll be doing some more

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-29 Thread Holden Karau
I’m excited to have more folks rotate through release manager :) On Sun, Jul 29, 2018 at 3:57 PM Stavros Kontopoulos < stavros.kontopou...@lightbend.com> wrote: > +1. That would great! > > Thanks, > Stavros > > On Sun, Jul 29, 2018 at 5:05 PM, Wenchen Fan wrote: > >> If no one objects, how about

Re: Review notification bot

2018-07-30 Thread Holden Karau
I can see the configurations for it? > > > 2018년 7월 23일 (월) 오전 10:16, Holden Karau 님이 작성: > >> Yeah so the issue with codeowners is it will only assign to committers on >> the repo (the Beam project found this out the practical application way). >> >> I have a f

Re: Review notification bot

2018-07-30 Thread Holden Karau
gt; but I couldn't find (sorry if it's just something I simply missed). > > 2018년 7월 31일 (화) 오전 1:48, Holden Karau 님이 작성: > >> So the one that is running is the the form in my own repo (set up for K8s >> deployment) - http://github.com/holdenk/mention-bot >> >>

Re: Review notification bot

2018-07-30 Thread Holden Karau
y this. >> Also, some people could be interested in few specific areas. They should >> get pinged too. >> Also, assuming from people pinged, seems they are reviewers (which >> basically means committers I guess). Was wondering if there's a big >> difference

Re: Review notification bot

2018-07-30 Thread Holden Karau
t; super happy with that pinging for now. I was slightly supportive for this >> idea but now I actually slightly >> became negative on this after observing how it goes in practice. >> >> I wonder how other people think on this. >> >> >> >> 2018년 7월 31일

Re: Review notification bot

2018-07-30 Thread Holden Karau
Another thing we could try and do (if folks would be down to try) is it have not actually ping, but suggest the potential usernames to ping to the user (e.g. say suggested reviewers you _may wish to ping_ and then list)? On Mon, Jul 30, 2018 at 10:45 PM, Holden Karau wrote: > > On Mon,

Re: Review notification bot

2018-07-30 Thread Holden Karau
; at least, (almost?) all of them are committers and something needs to be >>>> fixed even if so. >>>> >>>> I recently argued about pinging things before - sounds it matters if it >>>> annoys. Since pinging is completely optional and cc'ing someone else

Re: Review notification bot

2018-07-31 Thread Holden Karau
add a rate limit) > 5. Non-committers look not pinged given my observation > 6. It is completely optional and it's rather something committer should > regularly > - this could imply we don't have enough active committers. > > > 2018년 7월 31일 (화) 오후 2:12, Hold

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread Holden Karau
I'd like to suggest we consider SPARK-25004 (hopefully it goes in soon), but solving some of the consistent Python memory issues we've had for years would be really amazing to get in. On Tue, Aug 7, 2018 at 1:07 PM, Tom Graves wrote: > I would like to get clarification on our avro compatibilit

Re: SparkContext singleton get w/o create?

2018-08-27 Thread Holden Karau
Seems reasonable. We should probably add `getActiveSession` to the PySpark API (filed a starter JIRA https://issues.apache.org/jira/browse/SPARK-25255 ) On Mon, Aug 27, 2018 at 12:09 PM Andrew Melo wrote: > Hello Sean, others - > > Just to confirm, is it OK for client applications to access > Sp

Re: SparkContext singleton get w/o create?

2018-08-27 Thread Holden Karau
PM Andrew Melo wrote: > Hi Holden, > > I'm agnostic to the approach (though it seems cleaner to have an > explicit API for it). If you would like, I can take that JIRA and > implement it (should be a 3-line function). > > Cheers > Andrew > > On Mon, Aug 27, 2018

Re: Branch 2.4 is cut

2018-09-07 Thread Holden Karau
Was doing my weekly code review and went to close an issue, but since it wasn't one of the categories listed wasn't going to merge into the 2.4 branch but we need a new version in JIRA for us to close issues to that are going to merge into master but no

Python friendly API for Spark 3.0

2018-09-14 Thread Holden Karau
Since we're talking about Spark 3.0 in the near future (and since some recent conversation on a proposed change reminded me) I wanted to open up the floor and see if folks have any ideas on how we could make a more Python friendly API for 3.0? I'm planning on taking some time to look at other syste

Re: Python friendly API for Spark 3.0

2018-09-14 Thread Holden Karau
aching out to to user@ before making that kind of change. > > On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau > wrote: > >> Since we're talking about Spark 3.0 in the near future (and since some >> recent conversation on a proposed change reminded me) I wanted to open up >&g

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Holden Karau
Deprecating Py 2 in the 2.4 release probably doesn't belong in the RC vote thread. Personally I think we might be a little too late in the game to deprecate it in 2.4, but I think calling it out as "soon to be deprecated" in the release docs would be sensible to give folks extra time to prepare. O

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-20 Thread Holden Karau
So normally during the release process if it's in branch-2.4 but not part of the current RC we set the resolved version to 2.4.1 and then if roll a new RC we switch the 2.4.1 issues to 2.4.0. On Thu, Sep 20, 2018 at 9:55 PM Jungtaek Lim wrote: > I also noticed there're some fixed issues which ar

Re: Live Streamed Code Review today at 11am Pacific

2018-09-20 Thread Holden Karau
order batches) is my current plan to start with :) On Thu, Jul 19, 2018 at 11:38 PM Holden Karau wrote: > Heads up tomorrows Friday review is going to be at 8:30 am instead of 9:30 > am because I had to move some flights around. > > On Fri, Jul 13, 2018 at 12:03 PM, Holden Karau &

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-10-01 Thread Holden Karau
Oh that does look like an important correctness issue. -1 On Mon, Oct 1, 2018, 9:57 AM Marco Gaido wrote: > -1, I was able to reproduce SPARK-25538 with the provided data. > > Il giorno lun 1 ott 2018 alle ore 09:11 Ted Yu ha > scritto: > >> +1 >> >> Original message >> From:

Code review and Coding livestreams today

2018-10-12 Thread Holden Karau
I’ll be doing my regular weekly code review at 10am Pacific today - https://youtu.be/IlH-EGiWXK8 with a look at the current RC, and in the afternoon at 3pm Pacific I’ll be doing some live coding around WIP graceful decommissioning PR - https://youtu.be/4FKuYk2sbQ8 -- Twitter: https://twitter.com/h

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-12 Thread Holden Karau
Following up I just wanted to make sure this new blocker that Dongjoon designated is surfaced - https://jira.apache.org/jira/browse/SPARK-25579?filter=12340409&jql=affectedVersion%20%3D%202.4.0%20AND%20cf%5B12310320%5D%20is%20EMPTY%20AND%20project%20%3D%20spark%20AND%20(status%20%3D%20%22In%20Progr

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-13 Thread Holden Karau
So if it's a blocker would you think this should be a -1? On Fri, Oct 12, 2018 at 3:52 PM Dongjoon Hyun wrote: > Hi, Holden. > > Since that's a performance at 2.4.0, I marked as `Blocker` four days ago. > > Bests, > Dongjoon. > > > On Fri, Oct 12, 2

Helper methods for PySpark discussion

2018-10-26 Thread Holden Karau
Coming out of https://github.com/apache/spark/pull/21654 it was agreed the helper methods in question made sense but there was some desire for a plan as to which helper methods we should use. I'd like to purpose a light weight solution to start with for helper methods that match either Pandas or g

Re: Helper methods for PySpark discussion

2018-10-26 Thread Holden Karau
g on and suggests the explicit operation that would do the most >> equivalent thing. And perhaps raise a warning (using the warnings module) >> for things that might be unintuitively expensive. >> On Fri, Oct 26, 2018 at 12:15 Holden Karau wrote: >> >>> Coming out of https:

Re: Trigger full GC during executor idle time?

2018-12-31 Thread Holden Karau
Maybe it would make sense to loop in the paper authors? I imagine they might have more information than ended up in the paper. On Mon, Dec 31, 2018 at 2:10 PM Ryan Blue wrote: > After a quick look, I don't think that the paper's >

Austin Area Contributors and Reviewers

2019-01-19 Thread Holden Karau
aturally I care most about the Spark PR backlog). If you want to learn how to improve your OSS review skills that's awesome too - https://www.eventbrite.com/e/level-up-your-skills-with-open-source-code-reviews-a-live-review-with-holden-karau-tickets-54206463993 :) Cheers, Holden :) -- Twi

Re: [DISCUSS] SPIP: .NET bindings for Apache Spark

2019-02-27 Thread Holden Karau
I’m +1 with Seans comment on the JIRA initially starting outside of Spark is probably the easiest way forward. On Wed, Feb 27, 2019 at 10:04 AM Sriram Sundaresan < sriram.sundare...@imaginea.com> wrote: > I am interested to take this up. Please let me know how to > proceed/contribute to this. >

Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

2016-06-28 Thread Holden Karau
Looking at the Sink in 2.0 there is a warning (added in SPARK-16020 without a lot of details) that says "Note: You cannot apply any operators on `data` except consuming it (e.g., `collect/foreach`)." but I'm wondering if this restriction is perhaps too broadly worded? Provided that we consume the d

Re: Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

2016-06-28 Thread Holden Karau
ed late > in the release process we decided it was better to document the current > behavior, rather than do a large refactoring. > > On Tue, Jun 28, 2016 at 12:59 PM, Holden Karau > wrote: > >> Looking at the Sink in 2.0 there is a warning (added in SPARK-16020 >>

Re: [jira] [Resolved] (SPARK-16345) Extract graphx programming guide example snippets from source files instead of hard code them

2016-07-02 Thread Holden Karau
2.0.1 just means that the fix will be included in 2.0.1 (eg its not in the current 2.0.0 RC). On Saturday, July 2, 2016, Jacek Laskowski wrote: > Hi Sean, devs, > > How is this possible that Fix Version/s is 2.0.1 given 2.0.0 was not > released yet? Why is that that master is not what's going to

[PySPARK] - Py4J binary transfer survey

2016-07-06 Thread Holden Karau
Hi PySpark Devs, The Py4j developer has a survey up for Py4J users - https://github.com/bartdag/py4j/issues/237 it might be worth our time to provide some input on how we are using and would like to be using Py4J if binary transfer was improved. I'm happy to fill it out with my thoughts - but if o

Re: Spark performance regression test suite

2016-07-08 Thread Holden Karau
There are also the spark-perf and spark-sql-perf projects in the Databricks github (although I see an open issue for Spark 2.0 support in one of them). On Friday, July 8, 2016, Ted Yu wrote: > Found a few issues: > > [SPARK-6810] Performance benchmarks for SparkR > [SPARK-2833] performance tests

Re: Spark Homepage

2016-07-13 Thread Holden Karau
This has also been reported on the user@ by a few people - other apache projects (arrow & hadoop) don't seem to be affected so maybe it was a just bad update for the Spark website? On Wed, Jul 13, 2016 at 12:05 PM, Dongjoon Hyun wrote: > Hi, All. > > Currently, Spark Homepage (http://spark.apach

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-19 Thread Holden Karau
-1 : The docs don't seem to be fully built (e.g. http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs/streaming-programming-guide.html is a zero byte file currently) - although if this is a transient apache issue no worries. On Thu, Jul 14, 2016 at 11:59 AM, Reynold Xin wrote:

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-19 Thread Holden Karau
spark-2.0.0-rc4-docs-updated/ > > On Tue, Jul 19, 2016 at 3:19 PM Holden Karau wrote: > >> -1 : The docs don't seem to be fully built (e.g. >> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs/streaming-programming-guide.html >> is a zero byte file

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Holden Karau
+1 (non-binding) Built locally on Ubuntu 14.04, basic pyspark sanity checking & tested with a simple structured streaming project (spark-structured-streaming-ml) & spark-testing-base & high-performance-spark-examples (minor changes required from preview version but seem intentional & jetty conflic

Internal Deprecation warnings - worth fixing?

2016-07-27 Thread Holden Karau
Now that the 2.0 release is out the door and I've got some cycles to do some cleanups - I'd like to know what other people think of the internal deprecation warnings we've introduced in a lot of a places in our code. Once before I did some minor refactoring so the Python code which had to use the

Re: Internal Deprecation warnings - worth fixing?

2016-07-27 Thread Holden Karau
ts to still test the deprecated code but it > ought to be possible to make the non-test code avoid it entirely. > > On Wed, Jul 27, 2016 at 12:11 PM, Holden Karau > wrote: > > Now that the 2.0 release is out the door and I've got some cycles to do > some > > cleanup

Re: How do a new developer create or assign a jira ticket?

2016-07-27 Thread Holden Karau
Hi Neil, Thanks for your interest in participating in Apache Spark! You can create JIRAs - but first you will need to signup for an Apache JIRA account. Generally we can't assign JIRAs to ourselves - but you can leave a comment saying your interested in working. I think for R a good place to get s

Re: AccumulatorV2 += operator

2016-08-02 Thread Holden Karau
I believe it was intentional with the idea that it would be more unified between Java and Scala APIs. If your talking about the javadoc mention in https://github.com/apache/spark/pull/14466/files - I believe the += is meant to refer to what the internal implementation of the add function can be for

Re: AccumulatorV2 += operator

2016-08-03 Thread Holden Karau
, August 3, 2016, Bryan Cutler wrote: > No, I was referring to the programming guide section on accumulators, it > says " Tasks running on a cluster can then add to it using the add method > or the += operator (in Scala and Python)." > > On Aug 2, 2016 2:52 PM, "

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Holden Karau
Spark does not currently support Apache Arrow - probably a good place to chat would be on the Arrow mailing list where they are making progress towards unified JVM & Python/R support which is sort of a precondition of a functioning Arrow interface between Spark and Python. On Fri, Aug 5, 2016 at 1

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Holden Karau
upport Arrow? I'd just like to know that all the pieces will come > together eventually. > > (In this forum, most of the discussion about Arrow is about PySpark and > Pandas, not Spark in general.) > > Best, > Jim > > On Aug 5, 2016 2:43 PM, "Holden Karau"

Early Draft Structured Streaming Machine Learning

2016-08-18 Thread Holden Karau
Hi Everyone (that cares about structured streaming and ML), Seth and I have been giving some thought to support structured streaming in machine learning - we've put together an early design doc (its been in JIRA (SPARK-16424) for awhile, but inca

Re: Persisting PySpark ML Pipelines that include custom Transformers

2016-08-19 Thread Holden Karau
I don't think we've given a lot of thought to model persistence for custom Python models yet - if the Python models is wrapping a JVM model using the JavaMLWritable along with '_to_java' should work provided your Java model alread is saveable. On the other hand - if your model isn't wrapping a Java

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Holden Karau
I'm seeing some test failures with Python 3 that could definitely be environmental (going to rebuild my virtual env and double check), I'm just wondering if other people are also running the Python tests on this release or if everyone is focused on the Scala tests? On Mon, Sep 26, 2016 at 11:48 AM

StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-09-26 Thread Holden Karau
Hi Spark Developers, After some discussion on SPARK-16407 (and on the PR ) we’ve decided to jump back to the developer list (SPARK-16407 itself comes

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Holden Karau
>> Cheers >> >> >> On Mon, Sep 26, 2016 at 11:59 AM, Holden Karau >> wrote: >> >>> I'm seeing some test failures with Python 3 that could definitely be >>> environmental (going to rebuild my virtual env and double check), I'm just >

Re: welcoming Xiao Li as a committer

2016-10-04 Thread Holden Karau
Congratulations :D :) Yay! On Tue, Oct 4, 2016 at 11:14 AM, Suresh Thalamati < suresh.thalam...@gmail.com> wrote: > Congratulations, Xiao! > > > > > On Oct 3, 2016, at 10:46 PM, Reynold Xin wrote: > > > > Hi all, > > > > Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark > com

PySpark UDF Performance Exploration w/Jython (Early/rough 2~3X improvement*) [SPARK-15369]

2016-10-05 Thread Holden Karau
Hi Python Spark Developers & Users, As Datasets/DataFrames are becoming the core building block of Spark, and as someone who cares about Python Spark performance, I've been looking more at PySpark UDF performance. I've got an early WIP/request for comments pull request open

Re: Spark Improvement Proposals

2016-10-07 Thread Holden Karau
First off, thanks Cody for taking the time to put together these proposals - I think it has kicked off some wonderful discussion. I think dismissing people's complaints with Spark as largely trolls does us a disservice, it’s important for us to recognize our own shortcomings - otherwise we are bli

Re: PSA: JIRA resolutions and meanings

2016-10-08 Thread Holden Karau
We could certainly do that system - but given the current somewhat small set of active committers its clearly not scaling very well. There are many developers in Spark like Hyukjin, Cody, and myself who care about specific areas and can verify if an issue is still present in mainline. That being

Re: Improving governance / committers (split from Spark Improvement Proposals thread)

2016-10-10 Thread Holden Karau
I think it is really important to ensure that someone with a good understanding of Kafka is empowered around this component with a formal voice around - but I don't have much dev experience with our Kafka connectors so I can't speak to the specifics around it personally. More generally, I also fee

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-13 Thread Holden Karau
This is a thing I often have people ask me about, and then I do my best dissuade them from using Spark in the "hot path" and it's normally something which most people eventually accept. Fred might have more information for people for whom this is a hard requirement though. On Thursday, October 13,

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-13 Thread Holden Karau
Awesome, good points everyone. The ranking of the issues is super useful and I'd also completely forgotten about the lack of built in UDAF support which is rather important. There is a PR to make it easier to call/register JVM UDFs from Python which will hopefully help a bit there too. I'm getting

Re: Contributing to PySpark

2016-10-18 Thread Holden Karau
Hi Krishna, Thanks for your interest contributing to PySpark! I don't personally use either of those IDEs so I'll leave that part for someone else to answer - but in general you can find the building spark documentation at http://spark.apache.org/docs/latest/building-spark.html which includes note

Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-18 Thread Holden Karau
Right now the wiki isn't particularly accessible to updates by external contributors. We've already got a contributing to spark page which just links to the wiki - how about if we just move the wiki contents over? This way contributors can contribute to our documentation about how to contribute pro

Re: On convenience methods

2016-10-18 Thread Holden Karau
I think what Reynold means is that if its easy for a developer to build this convenience function using the current Spark API it probably doesn't need to go into Spark unless its being done to provide a similar API to a system we are attempting to be semi-compatible with (e.g. if a corresponding co

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Holden Karau
I'd also like to add Python 2.6 to the list of things. We've considered dropping it before but never followed through to the best of my knowledge (although on mobile right now so can't double check). On Tuesday, October 25, 2016, Sean Owen wrote: > I'd like to gauge where people stand on the iss

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-10-31 Thread Holden Karau
I believe Bryan is also working on this a little - and I'm a little busy with the other stuff but would love to stay in the loop on Arrow progress :) On Monday, October 31, 2016, mariusvniekerk wrote: > So i've been working on some very very early stage apache arrow > integration. > My current p

Re: Python Spark Improvements (forked from Spark Improvement Proposals)

2016-11-01 Thread Holden Karau
On that note there is some discussion on the Jira - https://issues.apache.org/jira/browse/SPARK-13534 :) On Mon, Oct 31, 2016 at 8:32 PM, Holden Karau wrote: > I believe Bryan is also working on this a little - and I'm a little busy > with the other stuff but would love to stay in

Blocked PySpark changes

2016-11-02 Thread Holden Karau
Hi Spark Developers & Maintainers, I know we've been talking a lot about what we want changes we want in PySpark to help keep it interesting and usable (see http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-td19422.html). On

Re: Using mention-bot to automatically ping potential reviewers

2016-11-06 Thread Holden Karau
So according the documentation it mostly uses blame lines which _might_ not be the best fit for Spark (since many of the people in the blame lines aren't going to have permission to commit the code). (Although it's possible that the algorithm that is actually used does more than the one described i

Re: Handling questions in the mailing lists

2016-11-10 Thread Holden Karau
That's a good question, looking at http://stackoverflow.com/tags/apache-spark/topusers shows a few contributors who have already been active on SO including some committers and PMC members with very high overall SO reputations for any administrative needs (as well as a number of other contributors

Re: issues with github pull request notification emails missing

2016-11-16 Thread Holden Karau
+1 it seems like I'm missing a number of my GitHub email notifications lately (although since I run my own mail server and forward I've been assuming it's my own fault). I've also had issues with having greatly delayed notifications on some of my own pull requests but that might be unrelated. On

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Holden Karau
I've been working on a blog post around this and hope to have it published early next month 😀 On Nov 17, 2016 10:16 PM, "Joseph Bradley" wrote: Hi Georg, It's true we need better documentation for this. I'd recommend checking out simple algorithms within Spark for examples: ml.feature.Tokenize

Re: Spark Wiki now migrated to spark.apache.org

2016-11-23 Thread Holden Karau
That's awesome thanks for doing the migration :) On Wed, Nov 23, 2016 at 3:29 AM Sean Owen wrote: > I completed the migration. You can see the results live right now at > http://spark.apache.org, and > https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage > > A summary of the changes:

Re: Can I add a new method to RDD class?

2016-12-05 Thread Holden Karau
Doing that requires publishing a custom version of Spark, you can edit the version number do do a publishLocal - but maintaining that change is going to be difficult. The other approaches suggested are probably better, but also does your method need to be defined on the RDD class? Could you instead

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-16 Thread Holden Karau
Thanks for the specific mention of the new PySpark packaging Shivaram, For *nix (Linux, Unix, OS X, etc.) Python users interested in helping test the new artifacts you can do as follows: Setup PySpark with pip by: 1. Download the artifact from http://home.apache.org/~pwendell/spark-releases/spar

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-18 Thread Holden Karau
+1 (non-binding) - checked Python artifacts with virtual env. On Sun, Dec 18, 2016 at 11:42 AM Denny Lee wrote: > +1 (non-binding) > > > On Sat, Dec 17, 2016 at 11:45 PM Liwei Lin wrote: > > +1 > > Cheers, > Liwei > > > > On Sat, Dec 17, 2016 at 10:29 AM, Yuming Wang wrote: > > I hope https://

Re: A note about MLlib's StandardScaler

2017-01-08 Thread Holden Karau
Hi Gilad, Spark uses the sample standard variance inside of the StandardScaler (see https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler ) which I think would explain the results you are seeing you are seeing. I believe the scalers are intended to

Re: handling of empty partitions

2017-01-08 Thread Holden Karau
Hi Georg, Thanks for the question along with the code (as well as posting to stack overflow). In general if a question is well suited for stackoverflow its probably better suited to the user@ list instead of the dev@ list so I've cc'd the user@ list for you. As far as handling empty partitions wh

Re: scala.MatchError: scala.collection.immutable.Range.Inclusive from catalyst.ScalaReflection.serializerFor?

2017-01-09 Thread Holden Karau
If you want to check if it's your modifications or just in mainline, you can always just checkout mainline or stash your current changes to rebuild (this is something I do pretty often when I run into bugs I don't think I would have introduced). On Mon, Jan 9, 2017 at 1:01 AM Liang-Chi Hsieh wrot

Re: [PYSPARK] Python tests organization

2017-01-12 Thread Holden Karau
I'd be happy to help with reviewing Python test improvements. Maybe make an umbrella JIRA and do one sub components at a time? On Thu, Jan 12, 2017 at 12:20 PM Saikat Kanjilal wrote: > > > > > > > > > > > > > > > Following up, any thoughts on next steps for this? > > > > > > > > > > > --

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Holden Karau
Also thanks everyone :) Looking forward to helping out (and if anyone wants to get started contributing to PySpark please ping me :)) On Tue, Jan 24, 2017 at 3:24 PM, Burak Yavuz wrote: > Thank you very much everyone! Hoping to help out the community as much as > I can! > > Best, > Burak > > On

Re: Google Summer of Code 2017 is coming

2017-02-03 Thread Holden Karau
As someone who did GSoC back in University I think this could be a good idea if there is enough interest from the PMC & I'd be willing the help mentor if that is a bottleneck. On Fri, Feb 3, 2017 at 12:42 PM, Jacek Laskowski wrote: > Hi, > > Is this something Spark considering? Would be nice to

Re: Is there any plan to have a predict method for single instance on PipelineModel?

2017-02-05 Thread Holden Karau
I'm in mobile right now but there is a JIRA to add it to the models first and on that JIRA people are discussing single element transform as a possibility - https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10413 There might be others as well that just aren't as fresh in my memory.

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Holden Karau
Congratulations Takuya-san :D! On Mon, Feb 13, 2017 at 11:16 AM, Reynold Xin wrote: > Hi all, > > Takuya-san has recently been elected an Apache Spark committer. He's been > active in the SQL area and writes very small, surgical patches that are > high quality. Please join me in congratulating T

[PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-13 Thread Holden Karau
Hi PySpark Developers, Cloudpickle is a core part of PySpark, and is originally copied from (and improved from) picloud. Since then other projects have found cloudpickle useful and a fork of cloudpickle was created and is now maintained as its own library

<    1   2   3   4   5   6   >