Re: Python friendly API for Spark 3.0

2018-09-14 Thread Nicholas Chammas
Do we need to ditch Python 2 support to provide type hints? I don’t think so. Python lets you specify typing stubs that provide the same benefit without forcing Python 3. 2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성: > > > On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson wrote: > >> To be clear,

Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-29 Thread Nicholas Chammas
Dunno if I made a silly mistake, but I wanted to bring some attention to this issue in case there was something serious going on here that might affect the upcoming release. https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-25150 Nick

Re: Review notification bot

2018-07-22 Thread Nicholas Chammas
e worth mentioning nonetheless. On Sat, Jul 14, 2018 at 11:17 AM Holden Karau wrote: > Hearing no objections (and in a shout out to @ Nicholas Chammas who > initially suggested mention-bot back in 2016) I've set up a copy of mention > bot and run it against my own repo (looks like &

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-02 Thread Nicholas Chammas
ri, Jun 1, 2018 at 6:51 PM, Nicholas Chammas > wrote: > > pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work for me > > either (even building with -Phadoop-2.7). I guess I’ve been relying on an > > unsupported pattern and will need to figure something else out

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Nicholas Chammas
. > > On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas > wrote: > > Building with -Phadoop-2.7 didn’t help, and if I remember correctly, > > building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0 release, > so > > it appears something has changed since then. >

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Nicholas Chammas
f you end up > mixing different versions of Hadoop. > > On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas > wrote: > > I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4 > using > > Flintrock. However, trying to load the hadoop-aws package gave me some &

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Nicholas Chammas
I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4 using Flintrock . However, trying to load the hadoop-aws package gave me some errors. $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4 :: problems summary :: WARNINGS

Re: Documenting the various DataFrame/SQL join types

2018-05-08 Thread Nicholas Chammas
t; On Tue, May 8, 2018 at 6:13 AM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> The documentation for DataFrame.join() >> <https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join> >> lists all the join types we sup

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-08 Thread Nicholas Chammas
Tue, May 8, 2018 at 6:00 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I certainly can, but the problem I’m facing is that of how best to track >> all the DataFrames I no longer want to persist. >> >> I create and persist various DataFrames throu

Re: eager execution and debuggability

2018-05-08 Thread Nicholas Chammas
This may be technically impractical, but it would be fantastic if we could make it easier to debug Spark programs without needing to rely on eager execution. Sprinkling .count() and .checkpoint() at various points in my code is still a debugging technique I use, but it always makes me wish Spark

Documenting the various DataFrame/SQL join types

2018-05-08 Thread Nicholas Chammas
The documentation for DataFrame.join() lists all the join types we support: - inner - cross - outer - full - full_outer - left - left_outer - right - right_outer - left_semi

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-08 Thread Nicholas Chammas
? ​ On Thu, May 3, 2018 at 10:26 PM Reynold Xin <r...@databricks.com> wrote: > Why do you need the underlying RDDs? Can't you just unpersist the > dataframes that you don't need? > > > On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas < > nicholas.cham...@gmail.

Identifying specific persisted DataFrames via getPersistentRDDs()

2018-04-30 Thread Nicholas Chammas
This seems to be an underexposed part of the API. My use case is this: I want to unpersist all DataFrames except a specific few. I want to do this because I know at a specific point in my pipeline that I have a handful of DataFrames that I need, and everything else is no longer needed. The

Re: Correlated subqueries in the DataFrame API

2018-04-27 Thread Nicholas Chammas
ol from source") >> val df = table.filter($"col".isin(subQ.toSet)) >> >> That also distinguishes between a sub-query and a correlated sub-query >> that uses values from the outer query. We would still need to come up with >> syntax for the correlated case, unless

Correlated subqueries in the DataFrame API

2018-04-09 Thread Nicholas Chammas
I just submitted SPARK-23945 but wanted to double check here to make sure I didn't miss something fundamental. Correlated subqueries are tracked at a high level in SPARK-18455 , but it's not

Re: Changing how we compute release hashes

2018-03-23 Thread Nicholas Chammas
To close the loop here: SPARK-23716 <https://issues.apache.org/jira/browse/SPARK-23716> On Fri, Mar 16, 2018 at 5:00 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > OK, will do. > > On Fri, Mar 16, 2018 at 4:41 PM Sean Owen <sro...@gmail.com> wrote: >

Re: Changing how we compute release hashes

2018-03-16 Thread Nicholas Chammas
t; On Fri, Mar 16, 2018 at 1:14 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I have sha512sum on my Mac via Homebrew, but yeah as long as the format >> is the same I suppose it doesn’t matter if we use shasum -a or sha512sum. >> >> So shall I fil

Re: Changing how we compute release hashes

2018-03-16 Thread Nicholas Chammas
> *To:* Felix Cheung > *Cc:* rb...@netflix.com; Nicholas Chammas; Spark dev list > > *Subject:* Re: Changing how we compute release hashes > I think the issue with that is that OS X doesn't have "sha512sum". Both it > and Linux have "shasum -a 512" thou

Changing how we compute release hashes

2018-03-15 Thread Nicholas Chammas
To verify that I’ve downloaded a Hadoop release correctly, I can just do this: $ shasum --check hadoop-2.7.5.tar.gz.sha256 hadoop-2.7.5.tar.gz: OK However, since we generate Spark release hashes with GPG

Re: Silencing messages from Ivy when calling spark-submit

2018-03-12 Thread Nicholas Chammas
understand some settings. If you happen to figure > out the answer, please report back here. I'm sure others would find it > useful too. > > Bryan > > On Mon, Mar 5, 2018 at 3:50 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Oh, I didn't kn

Re: Silencing messages from Ivy when calling spark-submit

2018-03-05 Thread Nicholas Chammas
ith "spark.jars.ivySettings" to point to your > ivysettings.xml file. Would that work for you to configure it there? > > Bryan > > On Mon, Mar 5, 2018 at 8:20 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I couldn’t get an answer any

Silencing messages from Ivy when calling spark-submit

2018-03-05 Thread Nicholas Chammas
I couldn’t get an answer anywhere else, so I thought I’d ask here. Is there a way to silence the messages that come from Ivy when you call spark-submit with --packages? (For the record, I asked this question on Stack Overflow .) Would it be a good

Re: Please keep s3://spark-related-packages/ alive

2018-03-01 Thread Nicholas Chammas
Marton, Thanks for the tip. (Too bad the docs referenced from the issue I opened with INFRA make no mention of mirrors.cgi.) Matei, A Requester Pays bucket is a good idea. I was trying to avoid

Re: Please keep s3://spark-related-packages/ alive

2018-02-27 Thread Nicholas Chammas
SF projects, like Spark, FWIW. >> > > To clarify, the apache-spark.rb formula in Homebrew uses the Apache > mirror closer.lua script > > > https://github.com/Homebrew/homebrew-core/blob/master/Formula/apache-spark.rb#L4 > >michael > > > >> On Mon, Feb 26

Please keep s3://spark-related-packages/ alive

2018-02-26 Thread Nicholas Chammas
If you go to the Downloads page and download Spark 2.2.1, you’ll get a link to an Apache mirror. It didn’t use to be this way. As recently as Spark 2.2.0, downloads were served via CloudFront , which was backed by an S3

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-23 Thread Nicholas Chammas
Launched a test cluster on EC2 with Flintrock and ran some simple tests. Building Spark took much longer than usual, but that may just be a fluke. Otherwise, all looks good to me. +1 On Fri, Feb 23, 2018 at 10:55 AM Denny Lee wrote:

Re: Kubernetes: why use init containers?

2018-01-09 Thread Nicholas Chammas
I’d like to point out the output of “git show —stat” for that diff: 29 files changed, 130 insertions(+), 1560 deletions(-) +1 for that and generally for the idea of leveraging spark-submit. You can argue that executors downloading from external servers would be faster than downloading from the

Re: Disabling Closed -> Reopened transition for non-committers

2017-10-05 Thread Nicholas Chammas
Whoops, didn’t mean to send that out to the list. Apologies. Somehow, an earlier draft of my email got sent out. Nick 2017년 10월 5일 (목) 오전 9:20, Nicholas Chammas <nicholas.cham...@gmail.com>님이 작성: > The first sign that that conversation was going to go downhill was when > the us

Re: Run a specific PySpark test or group of tests

2017-08-16 Thread Nicholas Chammas
om> wrote: This generally works for me to just run tests within a class or even a > single test. Not as flexible as pytest -k, which would be nice.. > > $ SPARK_TESTING=1 bin/pyspark pyspark.sql.tests ArrowTests > On Tue, Aug 15, 2017 at 5:49 AM, Nicholas Chammas < > nicholas.cham...

Re: Run a specific PySpark test or group of tests

2017-08-15 Thread Nicholas Chammas
if > I understood correctly. > > > 2017-08-15 3:06 GMT+09:00 Nicholas Chammas <nicholas.cham...@gmail.com>: > >> Say you’re working on something and you want to rerun the PySpark tests, >> focusing on a specific test or group of tests. Is there a way to do that? >> >&g

Run a specific PySpark test or group of tests

2017-08-14 Thread Nicholas Chammas
Say you’re working on something and you want to rerun the PySpark tests, focusing on a specific test or group of tests. Is there a way to do that? I know that you can test entire modules with this: ./python/run-tests --modules pyspark-sql But I’m looking for something more granular, like

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-20 Thread Nicholas Chammas
Steve, I think you're a good person to ask about this. Is the below any cause for concern? Or did I perhaps test this incorrectly? Nick On Tue, Apr 18, 2017 at 11:50 PM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > I had trouble starting up a shell with the AWS packa

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-18 Thread Nicholas Chammas
I had trouble starting up a shell with the AWS package loaded (specifically, org.apache.hadoop:hadoop-aws:2.7.3): [NOT FOUND ] com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms) local-m2-cache: tried

Re: Question on Spark's graph libraries roadmap

2017-03-13 Thread Nicholas Chammas
ty may force developers to more stable kind of API > / platforms & roadmaps. > > > > Thanks Enzo > > On 13 Mar 2017, at 22:09, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > > Your question is answered here under "Will GraphFrames be part of

Re: Question on Spark's graph libraries roadmap

2017-03-13 Thread Nicholas Chammas
Your question is answered here under "Will GraphFrames be part of Apache Spark?", no? http://graphframes.github.io/#what-are-graphframes Nick On Mon, Mar 13, 2017 at 4:56 PM enzo wrote: > Please see this email trail: no answer so far on the user@spark board.

Will .count() always trigger an evaluation of each row?

2017-02-17 Thread Nicholas Chammas
Especially during development, people often use .count() or .persist().count() to force evaluation of all rows — exposing any problems, e.g. due to bad data — and to load data into cache to speed up subsequent operations. But as the optimizer gets smarter, I’m guessing it will eventually learn

Re: Structured Streaming Spark Summit Demo - Databricks people

2017-02-15 Thread Nicholas Chammas
I don't think this is the right place for questions about Databricks. I'm pretty sure they have their own website with a forum for questions about their product. Maybe this? https://forums.databricks.com/ On Wed, Feb 15, 2017 at 2:34 PM Sam Elamin wrote: > Hey folks >

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Nicholas Chammas
Congratulations, Takuya!  On Mon, Feb 13, 2017 at 2:34 PM Felix Cheung wrote: > Congratulations! > > > -- > *From:* Xuefu Zhang > *Sent:* Monday, February 13, 2017 11:29:12 AM > *To:* Xiao Li > *Cc:* Holden Karau; Reynold

Re: Typo on spark.apache.org? "cyclic data flow"

2017-01-28 Thread Nicholas Chammas
Aye aye, cap'n. PR incoming. On Sat, Jan 28, 2017 at 2:44 PM Sean Owen <so...@cloudera.com> wrote: > Certainly a typo -- feel free to make a PR for the spark-website repo. > (Might search for other instances of 'cyclic' too) > > On Sat, Jan 28, 2017 at 7:18 P

Typo on spark.apache.org? "cyclic data flow"

2017-01-28 Thread Nicholas Chammas
The tagline on http://spark.apache.org/ says: "Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing." Isn't that supposed to be "acyclic" rather than "cyclic"? What does it mean to support cyclic data flow anyway? Nick

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Nicholas Chammas
  Congratulations, Burak and Holden. On Tue, Jan 24, 2017 at 1:27 PM Russell Spitzer wrote: > Great news! Congratulations! > > On Tue, Jan 24, 2017 at 10:25 AM Dean Wampler > wrote: > > Congratulations to both of you! > > dean > > *Dean

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-19 Thread Nicholas Chammas
Since it’s not a regression from 2.0 (I believe the same issue affects both 2.0 and 2.1) it doesn’t merit a -1 vote according to the voting guidelines. Of course, it would be nice if we could fix the various optimizer issues that all seem to have a workaround that involves persist() (another one

Re: Reduce memory usage of UnsafeInMemorySorter

2016-12-07 Thread Nicholas Chammas
ub.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java#L156 > > Regards, > Kazuaki Ishizaki > > > > From:Reynold Xin <r...@databricks.com> > To:Nicholas Chammas <nicholas.cham...@gma

Reduce memory usage of UnsafeInMemorySorter

2016-12-06 Thread Nicholas Chammas
this, and the question is about internals like UnsafeInMemorySorter, I hope this is OK here. Nick On Mon, Dec 5, 2016 at 9:11 AM Nicholas Chammas nicholas.cham...@gmail.com <http://mailto:nicholas.cham...@gmail.com> wrote: I was testing out a new project at scale on Spark 2.0.2 running on YARN, &g

Re: Difference between netty and netty-all

2016-12-05 Thread Nicholas Chammas
You mean just for branch-2.0, right? ​ On Mon, Dec 5, 2016 at 8:35 PM Shixiong(Ryan) Zhu <shixi...@databricks.com> wrote: > Hey Nick, > > It should be safe to upgrade Netty to the latest 4.0.x version. Could you > submit a PR, please? > > On Mon, Dec 5, 2016 at 11

Re: Difference between netty and netty-all

2016-12-05 Thread Nicholas Chammas
y/util/internal/ThreadLocalRandom.class > > On Mon, Dec 5, 2016 at 8:53 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > > I’m looking at the list of dependencies here: > > > https://github.com/apache/spark/search?l=Groff=netty=Code=%E2%9C%93 > &

Difference between netty and netty-all

2016-12-05 Thread Nicholas Chammas
I’m looking at the list of dependencies here: https://github.com/apache/spark/search?l=Groff=netty=Code=%E2%9C%93 What’s the difference between netty and netty-all? The reason I ask is because I’m looking at a Netty PR and trying to figure out if Spark

java.lang.IllegalStateException: There is no space for new record

2016-12-05 Thread Nicholas Chammas
I was testing out a new project at scale on Spark 2.0.2 running on YARN, and my job failed with an interesting error message: TaskSetManager: Lost task 37.3 in stage 31.0 (TID 10684, server.host.name): java.lang.IllegalStateException: There is no space for new record 05:27:09.573 at

Re: Future of the Python 2 support.

2016-12-04 Thread Nicholas Chammas
I don't think it makes sense to deprecate or drop support for Python 2.7 until at least 2020, when 2.7 itself will be EOLed. (As of Spark 2.0, Python 2.6 support is deprecated and will be removed by Spark 2.2. Python 2.7 is only version of Python 2 that's still fully supported.) Given the

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-11-30 Thread Nicholas Chammas
> -1 (non binding) https://issues.apache.org/jira/browse/SPARK-16589 No matter how useless in practice this shouldn't go to another major release. I agree that that issue is a major one since it relates to correctness, but since it's not a regression it technically does not merit a -1 vote on the

Re: Memory leak warnings in Spark 2.0.1

2016-11-23 Thread Nicholas Chammas
 Thanks for the reference and PR. On Wed, Nov 23, 2016 at 2:59 AM Reynold Xin <r...@databricks.com> wrote: > See https://issues.apache.org/jira/browse/SPARK-18557 > <https://issues.apache.org/jira/browse/SPARK-18557> > > On Mon, Nov 21, 2016 at 1:16 PM, Nicholas C

Re: Memory leak warnings in Spark 2.0.1

2016-11-21 Thread Nicholas Chammas
I'm also curious about this. Is there something we can do to help troubleshoot these leaks and file useful bug reports? On Wed, Oct 12, 2016 at 4:33 PM vonnagy wrote: > I am getting excessive memory leak warnings when running multiple mapping > and > aggregations and using

Re: Green dot in web UI DAG visualization

2016-11-17 Thread Nicholas Chammas
https://issues.apache.org/jira/browse/SPARK-18495 On Thu, Nov 17, 2016 at 12:23 PM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Nice catch Suhas, and thanks for the reference. Sounds like we need a > tweak to the UI so this little feature is self-documenting. > >

Re: Green dot in web UI DAG visualization

2016-11-17 Thread Nicholas Chammas
ry instead of from HDFS." > > from > https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html > > On Thu, Nov 17, 2016 at 9:19 AM, Reynold Xin <r...@databricks.com> wrote: > > Ha funny. Never noticed that. > > > On

Re: Green dot in web UI DAG visualization

2016-11-17 Thread Nicholas Chammas
Hmm... somehow the image didn't show up. How about now? [image: Screen Shot 2016-11-17 at 11.57.14 AM.png] On Thu, Nov 17, 2016 at 12:14 PM Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: Should I be able to see something? On Thu, Nov 17, 2016 at 9:10 AM, Ni

Green dot in web UI DAG visualization

2016-11-17 Thread Nicholas Chammas
Some questions about this DAG visualization: [image: Screen Shot 2016-11-17 at 11.57.14 AM.png] 1. What's the meaning of the green dot? 2. Should this be documented anywhere (if it isn't already)? Preferably a tooltip or something directly in the UI would explain the significance. Nick

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Nicholas Chammas
Has the release already been made? I didn't see any announcement, but Homebrew has already updated to 2.0.2. 2016년 11월 11일 (금) 오후 2:59, Reynold Xin 님이 작성: > The vote has passed with the following +1s and no -1. I will work on > packaging the release. > > +1: > > Reynold Xin*

Re: Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Nicholas Chammas
SPARK-18367 <https://issues.apache.org/jira/browse/SPARK-18367>: limit() makes the lame walk again On Tue, Nov 8, 2016 at 5:00 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > Hmm, it doesn’t seem like I can access the output of > df._jdf.queryExecution().hiveResultSt

Re: Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Nicholas Chammas
comparison simply by > doing replaceAll("#\\d+", "#x") > > similar to the patch here: > https://github.com/apache/spark/commit/fd90541c35af2bccf0155467bec8cea7c8865046#diff-432455394ca50800d5de508861984ca5R217 > > > > On Tue, Nov 8, 2016 at 1:42 PM, Nicholas

Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Nicholas Chammas
I’m trying to understand what I think is an optimizer bug. To do that, I’d like to compare the execution plans for a certain query with and without a certain change, to understand how that change is impacting the plan. How would I do that in PySpark? I’m working with 2.0.1, but I can use master

Using mention-bot to automatically ping potential reviewers

2016-11-06 Thread Nicholas Chammas
Howdy folks, I wonder if anybody has ever used Facebook's mention-bot in a project: https://github.com/facebook/mention-bot Seems like a useful tool to help address the problem of figuring out who to ping for review. If you've used it, what was your experience? Do you think it would be helpful

Re: Handling questions in the mailing lists

2016-11-02 Thread Nicholas Chammas
We’ve discussed several times upgrading our communication tools, as far back as 2014 and maybe even before that too. The bottom line is that we can’t due to ASF rules requiring the use of ASF-managed mailing lists. For some history, see this discussion: -

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Nicholas Chammas
blic. > That's harder to do with a language version deprecation since using such a > version doesn't really give you the same kind of repeated warnings that > using a deprecated API does. > > On Tue, Oct 25, 2016 at 12:59 PM, Nicholas Chammas < > nicholas.cham...@gm

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Nicholas Chammas
rstanding, the first steps toward removing support for Scala > 2.10 and/or Java 7 would be to deprecate them in 2.1.0. Actual removal of > support could then occur at the earliest in 2.2.0. > > On Tue, Oct 25, 2016 at 12:13 PM, Nicholas Chammas < > nicholas.cham...@gmail.com>

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Nicholas Chammas
FYI: Support for both Python 2.6 and Java 7 was deprecated in 2.0 (see release notes under Deprecations). The deprecation notice didn't offer a specific timeline for completely dropping support other than to say they "might be removed in

Re: Spark Improvement Proposals

2016-10-09 Thread Nicholas Chammas
On Sun, Oct 9, 2016 at 5:19 PM Cody Koeninger wrote: > Regarding name, if the SIP overlap is a concern, we can pick a different > name. > > My tongue in cheek suggestion would be > > Spark Lightweight Improvement process (SPARKLI) > If others share my minor concern about the

Re: Spark Improvement Proposals

2016-10-09 Thread Nicholas Chammas
ifferent from what Cody > had in mind, I think. > > > Matei > > On Oct 9, 2016, at 1:25 PM, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > > >- Rejected strategies: I personally wouldn’t put this, because what’s >the point of voting to

Re: Spark Improvement Proposals

2016-10-09 Thread Nicholas Chammas
- Rejected strategies: I personally wouldn’t put this, because what’s the point of voting to reject a strategy before you’ve really begun designing and implementing something? What if you discover that the strategy is actually better when you start doing stuff? I would guess the point

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Nicholas Chammas
very least a comment from them saying yes/no/later. On Fri, Oct 7, 2016 at 5:59 PM Cody Koeninger <c...@koeninger.org> wrote: > I really like the idea of using jira votes (and/or watchers?) as a filter! > > On Fri, Oct 7, 2016 at 4:41 PM, Nicholas Chammas > <nicholas.cham..

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Nicholas Chammas
I agree with Cody and others that we need some automation — or at least an adjusted process — to help us manage organic contributions better. The objections about automated closing being potentially abrasive are understood, but I wouldn’t accept that as a defeat for automation. Instead, it seems

Re: Spark Improvement Proposals

2016-10-07 Thread Nicholas Chammas
There are several important discussions happening simultaneously. Should we perhaps split them up into separate threads? Otherwise it’s really difficult to follow. It seems like the discussion about having a more formal “Spark Improvement Proposal” process should take priority here. Other

Re: Structured Streaming with Kafka sources/sinks

2016-08-30 Thread Nicholas Chammas
> I personally find it disappointing that a big chuck of Spark's design and development is happening behind closed curtains. I'm not too familiar with Streaming, but I see design docs and proposals for ML and SQL published here and on JIRA all the time, and they are discussed extensively. For

Re: Inconsistency for nullvalue handling CSV: see SPARK-16462, SPARK-16460, SPARK-15144, SPARK-17290 and SPARK-16903

2016-08-29 Thread Nicholas Chammas
I wish JIRA would automatically show you potentially similar issues as you are typing up a new one, like Stack Overflow does... It would really help cut down on duplicate reports. On Mon, Aug 29, 2016 at 10:55 PM Hyukjin Kwon wrote: > Hi all, > > > PR: >

Re: Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nicholas Chammas
orical > features) into ints (0-based indexes). It could (should) accept multiple > input columns for efficiency (see > https://issues.apache.org/jira/browse/SPARK-11215). This is a case where > multiple output columns would be required. > > N > > > On Tue, 23 Aug 201

Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nicholas Chammas
If you create your own Spark 2.x ML Transformer, there are multiple mix-ins (is that the correct term?) that you can use to define its behavior which are in ml/param/shared.py . Among them are the following mix-ins:

Re: Persisting PySpark ML Pipelines that include custom Transformers

2016-08-19 Thread Nicholas Chammas
yourself into this approach - > in either case much of the persistence work is up to you it's just a matter > if you do it in the JVM or Python. > > On Friday, August 19, 2016, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > >> I understand persistence for PySpa

Persisting PySpark ML Pipelines that include custom Transformers

2016-08-19 Thread Nicholas Chammas
I understand persistence for PySpark ML pipelines is already present in 2.0, and further improvements are being made for 2.1 (e.g. SPARK-13786 ). I’m having trouble, though, persisting a pipeline that includes a custom Transformer (see

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Nicholas Chammas
nd send me feedback or create issues at that github location. > > On Aug 11, 2016, at 7:42 AM, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > > Thanks Michael for the reference, and thanks Nick for the comprehensive > overview of existing JIRA discussions about t

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Nicholas Chammas
on, but >> we use it in production to serve a random forest model trained by a Spark >> ML pipeline. >> >> Thanks, >> >> Michael >> >> On Aug 10, 2016, at 7:50 PM, Nicholas Chammas <nicholas.cham...@gmail.com> >> wrote: >> >&g

Serving Spark ML models via a regular Python web app

2016-08-10 Thread Nicholas Chammas
Are there any existing JIRAs covering the possibility of serving up Spark ML models via, for example, a regular Python web app? The story goes like this: You train your model with Spark on several TB of data, and now you want to use it in a prediction service that you’re building, say with Flask

Re: Welcoming Felix Cheung as a committer

2016-08-08 Thread Nicholas Chammas
 Do we now have 2 SparkR-focused committers (Shivaram + Felix)? Or are there more? Nick On Mon, Aug 8, 2016 at 2:17 PM Dongjoon Hyun wrote: > Congratulation, Felix! > > Bests, > Dongjoon. > > > On Monday, August 8, 2016, Ted Yu wrote: > >>

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Nicholas Chammas
Don't know much about Spark + Arrow efforts myself; just wanted to share the reference. On Fri, Aug 5, 2016 at 6:53 PM Jim Pivarski <jpivar...@gmail.com> wrote: > On Fri, Aug 5, 2016 at 5:14 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Relevant jira

Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Nicholas Chammas
;ko...@tresata.com> wrote: > The tricky part is that the action needs to be inside the with block, not > just the transformation that uses the persisted data. > > On Aug 5, 2016 1:44 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com> > wrote: > > Okie dok

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Nicholas Chammas
Relevant jira: https://issues.apache.org/jira/browse/SPARK-13534 2016년 8월 5일 (금) 오후 5:22, Holden Karau 님이 작성: > I don't think there is an approximate timescale right now and its likely > any implementation would depend on a solid Java implementation of Arrow > being ready

Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Nicholas Chammas
Okie doke, I've filed a JIRA for this here: https://issues.apache.org/jira/browse/SPARK-16921 On Fri, Aug 5, 2016 at 2:08 AM Reynold Xin <r...@databricks.com> wrote: > Sounds like a great idea! > > On Friday, August 5, 2016, Nicholas Chammas <nicholas.cham...@gmail.com>

PySpark: Make persist() return a context manager

2016-08-04 Thread Nicholas Chammas
Context managers are a natural way to capture closely related setup and teardown code in Python. For example, they are commonly used when doing file I/O: with open('/path/to/file') as f: contents = f.read() ... Once

Re: Clarifying that spark-x.x.x-bin-hadoopx.x.tgz doesn't include Hadoop itself

2016-07-29 Thread Nicholas Chammas
;without-hadoop" which did not include any Hadoop jars. > > On Fri, Jul 29, 2016 at 12:56 PM, Nicholas Chammas > <nicholas.cham...@gmail.com> wrote: > > I had an interaction on my project today that suggested some people may > be > > confused about what

Clarifying that spark-x.x.x-bin-hadoopx.x.tgz doesn't include Hadoop itself

2016-07-29 Thread Nicholas Chammas
I had an interaction on my project today that suggested some people may be confused about what the packages available on the downloads page are actually for. Specifically, the various -hadoopx.x.tgz packages suggest that

PySpark UDFs with a return type of FloatType can't handle int return values

2016-07-28 Thread Nicholas Chammas
If I define a UDF in PySpark that has a return type of FloatType, but the underlying function actually returns an int, the UDF throws the int away and returns None. It seems that some machinery inside pyspark.sql.types is perhaps unaware that it can always cast ints to floats. Is this

Re: renaming "minor release" to "feature release"

2016-07-28 Thread Nicholas Chammas
+1 The semantics conveyed by "feature release" are compatible with the meaning of "minor release" under strict SemVer, but as argued are clearer from a user-communication point of view. http://semver.org Nick 2016년 7월 28일 (목) 오후 7:20, Matei Zaharia 님이 작성: > I also

Re: Cartesian join between DataFrames

2016-07-25 Thread Nicholas Chammas
n, Jul 25, 2016 at 6:45 PM Reynold Xin <r...@databricks.com> wrote: > DataFrame can do cartesian joins. > > > On July 25, 2016 at 3:43:19 PM, Nicholas Chammas ( > nicholas.cham...@gmail.com) wrote: > > It appears that RDDs can do a cartesian join, but not DataFrames. Is

Cartesian join between DataFrames

2016-07-25 Thread Nicholas Chammas
It appears that RDDs can do a cartesian join, but not DataFrames. Is there a fundamental reason why not, or is this just waiting for someone to implement? I know you can get the RDDs underlying the DataFrames and do the cartesian join that way, but you lose the schema of course. Nick

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-14 Thread Nicholas Chammas
-1 There is a typo here : There is specially handling for not-a-number (NaN)… Just kidding, of course (about the vote). :) I vote +1 (for realz). - Successfully

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-14 Thread Nicholas Chammas
Oh nevermind, just noticed your note. Apologies. On Thu, Jul 14, 2016 at 4:20 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > Just curious: Did we have an RC3? I don't remember seeing one. > > > On Thu, Jul 14, 2016 at 3:00 PM Reynold Xin <r...@databricks.com>

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-14 Thread Nicholas Chammas
Just curious: Did we have an RC3? I don't remember seeing one. On Thu, Jul 14, 2016 at 3:00 PM Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.0. The vote is open until Sunday, July 17, 2016 at 12:00 PDT and passes > if a

Re: Bad JIRA components

2016-07-07 Thread Nicholas Chammas
Thanks Reynold. On Thu, Jul 7, 2016 at 5:03 PM Reynold Xin <r...@databricks.com> wrote: > I deleted those. > > > On Thu, Jul 7, 2016 at 1:27 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> >> https://issues.apache.org/jira/browse/

Bad JIRA components

2016-07-07 Thread Nicholas Chammas
https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:components-panel There are several bad components in there, like docs, MLilb, and sq;. I’ve updated the issues that were assigned to them, but I don’t know if there is a way to delete these components

Re: Expanded docs for the various storage levels

2016-07-07 Thread Nicholas Chammas
JIRA is here: https://issues.apache.org/jira/browse/SPARK-16427 On Thu, Jul 7, 2016 at 3:18 PM Reynold Xin <r...@databricks.com> wrote: > Please create a patch. Thanks! > > > On Thu, Jul 7, 2016 at 12:07 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: &

Expanded docs for the various storage levels

2016-07-07 Thread Nicholas Chammas
I’m looking at the docs here: http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel A newcomer to Spark won’t understand the meaning of _2, or the meaning of _SER (or its value), and

Re: Please add an unsubscribe link to the footer of user list email

2016-06-28 Thread Nicholas Chammas
led infra ticket: https://issues.apache.org/jira/browse/INFRA-12185 >> >> >> >> On Mon, Jun 27, 2016 at 10:02 AM, Reynold Xin <r...@databricks.com> >> wrote: >> >>> Let me look into this... >>> >>> >>> On Monday, June 27, 2016

<    1   2   3   4   5   >