[jira] [Commented] (SPARK-18367) limit() makes the lame walk again

2016-11-09 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652406#comment-15652406 ] Nicholas Chammas commented on SPARK-18367: -- I've spent the day trying to narrow down what

[jira] [Commented] (SPARK-18367) limit() makes the lame walk again

2016-11-08 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649132#comment-15649132 ] Nicholas Chammas commented on SPARK-18367: -- On 2.0.x the caching is required due to SPARK-18254

[jira] [Commented] (SPARK-18367) limit() makes the lame walk again

2016-11-08 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649063#comment-15649063 ] Nicholas Chammas commented on SPARK-18367: -- I'm not trying to write any files actually

Re: Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Nicholas Chammas
SPARK-18367 <https://issues.apache.org/jira/browse/SPARK-18367>: limit() makes the lame walk again On Tue, Nov 8, 2016 at 5:00 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > Hmm, it doesn’t seem like I can access the output of > df._jdf.queryExecution().hiveResultSt

[jira] [Updated] (SPARK-18367) limit() makes the lame walk again

2016-11-08 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18367: - Description: I have a complex DataFrame query that fails to run normally but succeeds

[jira] [Updated] (SPARK-18367) limit() makes the lame walk again

2016-11-08 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18367: - Attachment: plan-without-limit.txt plan-with-limit.txt > limit() ma

[jira] [Created] (SPARK-18367) limit() makes the lame walk again

2016-11-08 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-18367: Summary: limit() makes the lame walk again Key: SPARK-18367 URL: https://issues.apache.org/jira/browse/SPARK-18367 Project: Spark Issue Type: Bug

Re: Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Nicholas Chammas
comparison simply by > doing replaceAll("#\\d+", "#x") > > similar to the patch here: > https://github.com/apache/spark/commit/fd90541c35af2bccf0155467bec8cea7c8865046#diff-432455394ca50800d5de508861984ca5R217 > > > > On Tue, Nov 8, 2016 at 1:42 PM, Nicholas

Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Nicholas Chammas
I’m trying to understand what I think is an optimizer bug. To do that, I’d like to compare the execution plans for a certain query with and without a certain change, to understand how that change is impacting the plan. How would I do that in PySpark? I’m working with 2.0.1, but I can use master

Using mention-bot to automatically ping potential reviewers

2016-11-06 Thread Nicholas Chammas
Howdy folks, I wonder if anybody has ever used Facebook's mention-bot in a project: https://github.com/facebook/mention-bot Seems like a useful tool to help address the problem of figuring out who to ping for review. If you've used it, what was your experience? Do you think it would be helpful

[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637713#comment-15637713 ] Nicholas Chammas commented on SPARK-18277: -- {quote} If you try {{when()}}, you realize that you

[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637654#comment-15637654 ] Nicholas Chammas commented on SPARK-18277: -- Thanks for the pointer. I'll follow the discussion

[jira] [Comment Edited] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637566#comment-15637566 ] Nicholas Chammas edited comment on SPARK-18277 at 11/4/16 8:25 PM

[jira] [Updated] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18277: - Description: It appears that you cannot use {{fill()}} and friends to quickly modify

[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637566#comment-15637566 ] Nicholas Chammas commented on SPARK-18277: -- [~marmbrus] / [~yhuai]: Is there is workaround

[jira] [Updated] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18277: - Description: It appears that you cannot use {{fill()}} and friends to quickly modify

[jira] [Created] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-18277: Summary: na.fill() and friends should work on struct fields Key: SPARK-18277 URL: https://issues.apache.org/jira/browse/SPARK-18277 Project: Spark

[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI

2016-11-04 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636944#comment-15636944 ] Nicholas Chammas commented on SPARK-18128: -- For the record: Let's also check with the PyPI

[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI

2016-11-04 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636508#comment-15636508 ] Nicholas Chammas commented on SPARK-18128: -- [~prabinb] - See [this discussion|https

[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI

2016-11-04 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636497#comment-15636497 ] Nicholas Chammas commented on SPARK-18128: -- For the record: A PyPI admin is looking

[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI

2016-11-04 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636500#comment-15636500 ] Nicholas Chammas commented on SPARK-18128: -- [~holdenk] - Shall we make this issue a subtask

[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634455#comment-15634455 ] Nicholas Chammas commented on SPARK-18254: --    So it was specifically some broken

[jira] [Comment Edited] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634391#comment-15634391 ] Nicholas Chammas edited comment on SPARK-18254 at 11/3/16 9:58 PM

[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634428#comment-15634428 ] Nicholas Chammas commented on SPARK-18254: -- Just tried it. Seems like the fix is only available

[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634391#comment-15634391 ] Nicholas Chammas commented on SPARK-18254: -- If I try branch-2.1

[jira] [Comment Edited] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634391#comment-15634391 ] Nicholas Chammas edited comment on SPARK-18254 at 11/3/16 9:46 PM

[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633744#comment-15633744 ] Nicholas Chammas commented on SPARK-18254: -- Interestingly, if I add {{names_cleaned.persist

[jira] [Comment Edited] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633446#comment-15633446 ] Nicholas Chammas edited comment on SPARK-18254 at 11/3/16 4:57 PM: --- Yes

[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633446#comment-15633446 ] Nicholas Chammas commented on SPARK-18254: -- Yes, if I don't alias the columns and/or update

[jira] [Updated] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18254: - Description: Dunno if I'm misinterpreting something here, but this seems like a bug

[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633424#comment-15633424 ] Nicholas Chammas commented on SPARK-18254: -- Yep, it works fine if the column names haven't been

[jira] [Updated] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18254: - Description: Dunno if I'm misinterpreting something here, but this seems like a bug

[jira] [Updated] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18254: - Description: Dunno if I'm misinterpreting something here, but this seems like a bug

[jira] [Updated] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18254: - Description: Dunno if I'm misinterpreting something here, but this seems like a bug

[jira] [Updated] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18254: - Summary: UDFs don't see aliased column names (was: UDFs don't see aliased column names

[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names; somehow they get the original names

2016-11-03 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633220#comment-15633220 ] Nicholas Chammas commented on SPARK-18254: -- [~marmbrus] / [~hvanhovell]: Is there a workaround

[jira] [Created] (SPARK-18254) UDFs don't see aliased column names; somehow they get the original names

2016-11-03 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-18254: Summary: UDFs don't see aliased column names; somehow they get the original names Key: SPARK-18254 URL: https://issues.apache.org/jira/browse/SPARK-18254

[jira] [Commented] (SPARK-16726) Improve `Union/Intersect/Except` error messages on incompatible types

2016-11-02 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15630597#comment-15630597 ] Nicholas Chammas commented on SPARK-16726: -- I just hit this error in 2.0.1 and it was this JIRA

Re: Handling questions in the mailing lists

2016-11-02 Thread Nicholas Chammas
We’ve discussed several times upgrading our communication tools, as far back as 2014 and maybe even before that too. The bottom line is that we can’t due to ASF rules requiring the use of ASF-managed mailing lists. For some history, see this discussion: -

[jira] [Commented] (SPARK-14900) spark.ml classification metrics should include accuracy

2016-10-29 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15618637#comment-15618637 ] Nicholas Chammas commented on SPARK-14900: -- I don't know if this belongs in a separate issue

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Nicholas Chammas
blic. > That's harder to do with a language version deprecation since using such a > version doesn't really give you the same kind of repeated warnings that > using a deprecated API does. > > On Tue, Oct 25, 2016 at 12:59 PM, Nicholas Chammas < > nicholas.cham...@gm

[jira] [Commented] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2016-10-25 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606484#comment-15606484 ] Nicholas Chammas commented on SPARK-18084: -- cc [~marmbrus] - Dunno if this is actually bug

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Nicholas Chammas
rstanding, the first steps toward removing support for Scala > 2.10 and/or Java 7 would be to deprecate them in 2.1.0. Actual removal of > support could then occur at the earliest in 2.2.0. > > On Tue, Oct 25, 2016 at 12:13 PM, Nicholas Chammas < > nicholas.cham...@gmail.com>

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Nicholas Chammas
FYI: Support for both Python 2.6 and Java 7 was deprecated in 2.0 (see release notes under Deprecations). The deprecation notice didn't offer a specific timeline for completely dropping support other than to say they "might be removed in

[jira] [Updated] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2016-10-24 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18084: - Issue Type: Bug (was: Improvement) > write.partitionBy() does not recognize nes

[jira] [Created] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2016-10-24 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-18084: Summary: write.partitionBy() does not recognize nested columns that select() can access Key: SPARK-18084 URL: https://issues.apache.org/jira/browse/SPARK-18084

[jira] [Commented] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads

2016-10-24 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603211#comment-15603211 ] Nicholas Chammas commented on SPARK-12757: -- Just to link back, [~josephkb] is reporting

[jira] [Closed] (SPARK-17976) Global options to spark-submit should not be position-sensitive

2016-10-17 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas closed SPARK-17976. Resolution: Not A Problem Ah, makes perfect sense. Would have realized that myself if I

[jira] [Created] (SPARK-17976) Global options to spark-submit should not be position-sensitive

2016-10-17 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-17976: Summary: Global options to spark-submit should not be position-sensitive Key: SPARK-17976 URL: https://issues.apache.org/jira/browse/SPARK-17976 Project

Re: Spark Improvement Proposals

2016-10-09 Thread Nicholas Chammas
On Sun, Oct 9, 2016 at 5:19 PM Cody Koeninger wrote: > Regarding name, if the SIP overlap is a concern, we can pick a different > name. > > My tongue in cheek suggestion would be > > Spark Lightweight Improvement process (SPARKLI) > If others share my minor concern about the

Re: Spark Improvement Proposals

2016-10-09 Thread Nicholas Chammas
ifferent from what Cody > had in mind, I think. > > > Matei > > On Oct 9, 2016, at 1:25 PM, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > > >- Rejected strategies: I personally wouldn’t put this, because what’s >the point of voting to

Re: Spark Improvement Proposals

2016-10-09 Thread Nicholas Chammas
- Rejected strategies: I personally wouldn’t put this, because what’s the point of voting to reject a strategy before you’ve really begun designing and implementing something? What if you discover that the strategy is actually better when you start doing stuff? I would guess the point

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Nicholas Chammas
very least a comment from them saying yes/no/later. On Fri, Oct 7, 2016 at 5:59 PM Cody Koeninger <c...@koeninger.org> wrote: > I really like the idea of using jira votes (and/or watchers?) as a filter! > > On Fri, Oct 7, 2016 at 4:41 PM, Nicholas Chammas > <nicholas.cham..

Re: Improving volunteer management / JIRAs (split from Spark Improvement Proposals thread)

2016-10-07 Thread Nicholas Chammas
I agree with Cody and others that we need some automation — or at least an adjusted process — to help us manage organic contributions better. The objections about automated closing being potentially abrasive are understood, but I wouldn’t accept that as a defeat for automation. Instead, it seems

Re: Spark Improvement Proposals

2016-10-07 Thread Nicholas Chammas
There are several important discussions happening simultaneously. Should we perhaps split them up into separate threads? Otherwise it’s really difficult to follow. It seems like the discussion about having a more formal “Spark Improvement Proposal” process should take priority here. Other

Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
s.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any mo

Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
age or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 2 September 201

Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh wrote: > I believe as we progress in time Spark is going to move away from Python. If > you look at 2014 Databricks code examples, they were mostly in Python. Now > they are mostly in Scala for a reason. > That's complete

[jira] [Comment Edited] (SPARK-14742) Redirect spark-ec2 doc to new location

2016-08-31 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452150#comment-15452150 ] Nicholas Chammas edited comment on SPARK-14742 at 8/31/16 12:50 PM

[jira] [Commented] (SPARK-14742) Redirect spark-ec2 doc to new location

2016-08-31 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452150#comment-15452150 ] Nicholas Chammas commented on SPARK-14742: -- Sounds good to me. > Redirect spark-ec2 doc to

[jira] [Commented] (SPARK-14742) Redirect spark-ec2 doc to new location

2016-08-30 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450626#comment-15450626 ] Nicholas Chammas commented on SPARK-14742: -- {quote} Otherwise the only way to get to this link

[jira] [Commented] (SPARK-14742) Redirect spark-ec2 doc to new location

2016-08-30 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450602#comment-15450602 ] Nicholas Chammas commented on SPARK-14742: -- http://spark.apache.org/docs/latest/ec2-scripts.html

Re: Structured Streaming with Kafka sources/sinks

2016-08-30 Thread Nicholas Chammas
> I personally find it disappointing that a big chuck of Spark's design and development is happening behind closed curtains. I'm not too familiar with Streaming, but I see design docs and proposals for ML and SQL published here and on JIRA all the time, and they are discussed extensively. For

Re: Inconsistency for nullvalue handling CSV: see SPARK-16462, SPARK-16460, SPARK-15144, SPARK-17290 and SPARK-16903

2016-08-29 Thread Nicholas Chammas
I wish JIRA would automatically show you potentially similar issues as you are typing up a new one, like Stack Overflow does... It would really help cut down on duplicate reports. On Mon, Aug 29, 2016 at 10:55 PM Hyukjin Kwon wrote: > Hi all, > > > PR: >

[jira] [Updated] (SPARK-17220) Upgrade Py4J to 0.10.3

2016-08-26 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-17220: - Component/s: PySpark > Upgrade Py4J to 0.1

[jira] [Commented] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

2016-08-25 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437616#comment-15437616 ] Nicholas Chammas commented on SPARK-14241: -- [~marmbrus] - Would it be tough to make

Re: Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nicholas Chammas
orical > features) into ints (0-based indexes). It could (should) accept multiple > input columns for efficiency (see > https://issues.apache.org/jira/browse/SPARK-11215). This is a case where > multiple output columns would be required. > > N > > > On Tue, 23 Aug 201

Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nicholas Chammas
If you create your own Spark 2.x ML Transformer, there are multiple mix-ins (is that the correct term?) that you can use to define its behavior which are in ml/param/shared.py . Among them are the following mix-ins:

[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-19 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428988#comment-15428988 ] Nicholas Chammas commented on SPARK-17025: -- {quote} We'd need to figure out a good design

Re: Persisting PySpark ML Pipelines that include custom Transformers

2016-08-19 Thread Nicholas Chammas
yourself into this approach - > in either case much of the persistence work is up to you it's just a matter > if you do it in the JVM or Python. > > On Friday, August 19, 2016, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > >> I understand persistence for PySpa

Persisting PySpark ML Pipelines that include custom Transformers

2016-08-19 Thread Nicholas Chammas
I understand persistence for PySpark ML pipelines is already present in 2.0, and further improvements are being made for 2.1 (e.g. SPARK-13786 ). I’m having trouble, though, persisting a pipeline that includes a custom Transformer (see

[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-11 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788 ] Nicholas Chammas edited comment on SPARK-17025 at 8/11/16 7:33 PM: --- cc

[jira] [Comment Edited] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-11 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788 ] Nicholas Chammas edited comment on SPARK-17025 at 8/11/16 7:27 PM: --- cc

[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-11 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417788#comment-15417788 ] Nicholas Chammas commented on SPARK-17025: -- cc [~josephkb] [~mengxr] > Cannot persist PySp

[jira] [Created] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-11 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-17025: Summary: Cannot persist PySpark ML Pipeline model that includes custom Transformer Key: SPARK-17025 URL: https://issues.apache.org/jira/browse/SPARK-17025

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Nicholas Chammas
nd send me feedback or create issues at that github location. > > On Aug 11, 2016, at 7:42 AM, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > > Thanks Michael for the reference, and thanks Nick for the comprehensive > overview of existing JIRA discussions about t

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Nicholas Chammas
on, but >> we use it in production to serve a random forest model trained by a Spark >> ML pipeline. >> >> Thanks, >> >> Michael >> >> On Aug 10, 2016, at 7:50 PM, Nicholas Chammas <nicholas.cham...@gmail.com> >> wrote: >> >&g

Serving Spark ML models via a regular Python web app

2016-08-10 Thread Nicholas Chammas
Are there any existing JIRAs covering the possibility of serving up Spark ML models via, for example, a regular Python web app? The story goes like this: You train your model with Spark on several TB of data, and now you want to use it in a prediction service that you’re building, say with Flask

Re: UNSUBSCRIBE

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe: http://spark.apache.org/community.html On Tue, Aug 9, 2016 at 5:14 PM abhishek singh wrote: > >

Re: UNSUBSCRIBE

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe: http://spark.apache.org/community.html On Tue, Aug 9, 2016 at 8:03 PM James Ding wrote: > >

Re: UNSUBSCRIBE

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe: http://spark.apache.org/community.html On Wed, Aug 10, 2016 at 2:46 AM Martin Somers wrote: > > > -- > M >

Re: Unsubscribe

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe: http://spark.apache.org/community.html On Tue, Aug 9, 2016 at 3:02 PM Hogancamp, Aaron < aaron.t.hoganc...@leidos.com> wrote: > Unsubscribe. > > > > Thanks, > > > > Aaron Hogancamp > > Data Scientist > > >

Re: Unsubscribe.

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe: http://spark.apache.org/community.html On Tue, Aug 9, 2016 at 3:05 PM Martin Somers wrote: > Unsubscribe. > > Thanks > M >

[jira] [Commented] (SPARK-16921) RDD/DataFrame persist() and cache() should return Python context managers

2016-08-09 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414067#comment-15414067 ] Nicholas Chammas commented on SPARK-16921: -- [~holdenk] - Probably won't be able to do it myself

Re: Welcoming Felix Cheung as a committer

2016-08-08 Thread Nicholas Chammas
 Do we now have 2 SparkR-focused committers (Shivaram + Felix)? Or are there more? Nick On Mon, Aug 8, 2016 at 2:17 PM Dongjoon Hyun wrote: > Congratulation, Felix! > > Bests, > Dongjoon. > > > On Monday, August 8, 2016, Ted Yu wrote: > >>

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Nicholas Chammas
Don't know much about Spark + Arrow efforts myself; just wanted to share the reference. On Fri, Aug 5, 2016 at 6:53 PM Jim Pivarski <jpivar...@gmail.com> wrote: > On Fri, Aug 5, 2016 at 5:14 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Relevant jira

Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Nicholas Chammas
;ko...@tresata.com> wrote: > The tricky part is that the action needs to be inside the with block, not > just the transformation that uses the persisted data. > > On Aug 5, 2016 1:44 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com> > wrote: > > Okie dok

Re: Apache Arrow data in buffer to RDD/DataFrame/Dataset?

2016-08-05 Thread Nicholas Chammas
Relevant jira: https://issues.apache.org/jira/browse/SPARK-13534 2016년 8월 5일 (금) 오후 5:22, Holden Karau 님이 작성: > I don't think there is an approximate timescale right now and its likely > any implementation would depend on a solid Java implementation of Arrow > being ready

Re: PySpark: Make persist() return a context manager

2016-08-05 Thread Nicholas Chammas
Okie doke, I've filed a JIRA for this here: https://issues.apache.org/jira/browse/SPARK-16921 On Fri, Aug 5, 2016 at 2:08 AM Reynold Xin <r...@databricks.com> wrote: > Sounds like a great idea! > > On Friday, August 5, 2016, Nicholas Chammas <nicholas.cham...@gmail.com>

[jira] [Created] (SPARK-16921) RDD/DataFrame persist() and cache() should return Python context managers

2016-08-05 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-16921: Summary: RDD/DataFrame persist() and cache() should return Python context managers Key: SPARK-16921 URL: https://issues.apache.org/jira/browse/SPARK-16921

[jira] [Closed] (SPARK-7505) Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, etc.

2016-08-05 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas closed SPARK-7505. --- Resolution: Invalid Closing this as invalid as I believe these issues are no longer important

[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs

2016-08-05 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409767#comment-15409767 ] Nicholas Chammas commented on SPARK-5312: - [~boyork] - Shall we close this? It doesn't look like

Re: Add column sum as new column in PySpark dataframe

2016-08-05 Thread Nicholas Chammas
I think this is what you need: import pyspark.sql.functions as sqlf df.withColumn('total', sqlf.sum(df.columns)) Nic On Thu, Aug 4, 2016 at 9:41 AM Javier Rey jre...@gmail.com wrote: Hi everybody, > > Sorry, I sent last mesage it was imcomplete this is

PySpark: Make persist() return a context manager

2016-08-04 Thread Nicholas Chammas
Context managers are a natural way to capture closely related setup and teardown code in Python. For example, they are commonly used when doing file I/O: with open('/path/to/file') as f: contents = f.read() ... Once

Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Nicholas Chammas
curious what to > use instead. > > On Aug 4, 2016, at 3:54 PM, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > > Have you looked at pyspark.sql.functions.udf and the associated examples? > 2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen <bteeu...@gmail.com>님이 작성: > &

Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Nicholas Chammas
Have you looked at pyspark.sql.functions.udf and the associated examples? 2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen 님이 작성: > Hi, > > I’d like to use a UDF in pyspark 2.0. As in .. > > > def squareIt(x): > return x * x > > # register the function and define return type >

[jira] [Comment Edited] (SPARK-7146) Should ML sharedParams be a public API?

2016-08-02 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405300#comment-15405300 ] Nicholas Chammas edited comment on SPARK-7146 at 8/3/16 4:45 AM: - A quick

[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?

2016-08-02 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405300#comment-15405300 ] Nicholas Chammas commented on SPARK-7146: - A quick update from a PySpark user: I am using

[jira] [Commented] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings

2016-08-01 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402515#comment-15402515 ] Nicholas Chammas commented on SPARK-16782: -- Poking around a bit more, it seems like a possible

[jira] [Commented] (SPARK-16782) Use Sphinx autodoc to eliminate duplication of Python docstrings

2016-08-01 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-16782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402477#comment-15402477 ] Nicholas Chammas commented on SPARK-16782: -- Hmm never mind. I think I've misunderstood

<    2   3   4   5   6   7   8   9   10   11   >