[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-17 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584239#comment-16584239 ] Nicholas Chammas edited comment on SPARK-25150 at 8/17/18 6:15 PM: --- I

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-17 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584239#comment-16584239 ] Nicholas Chammas commented on SPARK-25150: -- I know there are a bunch of pending bug fixes

[jira] [Updated] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-17 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-25150: - Attachment: zombie-analysis.py states.csv persons.csv

[jira] [Created] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-17 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-25150: Summary: Joining DataFrames derived from the same source yields confusing/incorrect results Key: SPARK-25150 URL: https://issues.apache.org/jira/browse/SPARK-25150

Re: [Python-ideas] Toxic forum

2018-08-13 Thread Nicholas Chammas
>From what I’ve seen on this list in my relatively brief time here, this forum is mostly fine and the participants generally behave like adults. I don’t read every thread, so maybe I don’t have an accurate picture. From what I’ve seen, there is the occasional spat, where people just need to step

Re: [Python-ideas] File format for automatic and manual tests

2018-08-09 Thread Nicholas Chammas
On Wed, Aug 8, 2018 at 5:09 AM Paul Moore wrote: > This strikes me as *absolutely* something that should be promoted > outside of the stdlib, as a 3rd party project, and once it's > established as a commonly used and accepted standard, only then > propose that the stdlib offer support for it (if

Re: [Python-ideas] Revisiting dedicated overloadable boolean operators

2018-08-03 Thread Nicholas Chammas
On Fri, Aug 3, 2018 at 1:47 PM Todd toddr...@gmail.com wrote: The operators would be: > > bNOT - boolean "not" > bAND - boolean "and" > bOR - boolean "or" > bXOR - boolean "xor" > These look pretty ugly to me. But that could just be a matter of familiarity. For

Re: [Python-ideas] PEP 505: None-aware operators

2018-07-29 Thread Nicholas Chammas
On Sun, Jul 29, 2018 at 10:58 AM Steven D'Aprano wrote: > On Sun, Jul 29, 2018 at 06:32:19AM -0400, David Mertz wrote: > > On Sun, Jul 29, 2018, 2:00 AM Steven D'Aprano > wrote: > > > > > Fine. So it takes them an extra day to learn one more operator. Big > > > deal. It is commonly believed to

Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Thu, Jul 26, 2018 at 12:17 AM David Mertz wrote: > On Thu, Jul 26, 2018 at 12:00 AM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Forgive me for being slow. I'm missing what's different in semantics >> between the translation above and Chris's tr

Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Wed, Jul 25, 2018 at 11:09 PM David Mertz wrote: > On Wed, Jul 25, 2018 at 10:50 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Indeed. Thanks for the counter-example. I think the correct translation >> is as follows: >> food = spam?.eg

Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Wed, Jul 25, 2018 at 10:12 PM David Mertz wrote: > On Wed, Jul 25, 2018 at 9:47 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> > That is disingenuous, I think. Can this raise an AttributeError? >>> > spam?.eggs?.bacon >>

Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Wed, Jul 25, 2018 at 9:20 PM Chris Angelico wrote: > On Thu, Jul 26, 2018 at 11:02 AM, David Mertz wrote: > > That is disingenuous, I think. Can this raise an AttributeError? > > > > spam?.eggs?.bacon > > > > Of course it can! And this is exactly the pattern used in many examples > in >

Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Wed, Jul 25, 2018 at 6:11 PM Abe Dillon wrote: > The problem here is not whether it's explicit. It's about Readability and > conciseness. Using symbols in place of words almost always harms > readability in favor of conciseness. > > value = person.name if person.name else person > > almost

Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
25, 2018, 5:17 PM Nicholas Chammas > wrote: > >> On Mon, Jul 23, 2018 at 6:05 PM Giampaolo Rodola' >> wrote: >> >>> This: >>> >>> v = a?.b >>> >>> ...*implicitly* checks if value is not None [and continues exe

Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Wed, Jul 25, 2018 at 12:12 PM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > On Mon, Jul 23, 2018 at 6:05 PM Giampaolo Rodola' > wrote: > >> This: >> >> v = a?.b >> >> ...*implicitly* checks if value is not None [and continues execu

Re: [Python-ideas] PEP 505: None-aware operators

2018-07-25 Thread Nicholas Chammas
On Mon, Jul 23, 2018 at 6:05 PM Giampaolo Rodola' wrote: > This: > > v = a?.b > > ...*implicitly* checks if value is not None [and continues execution]. > This: > > v = a > if a.b is not None: > v = a.b > > ...*explicitly* checks if value is not None and continues execution.

[Distutils] Re: Make an ordered list of sdists to be installed?

2018-07-23 Thread Nicholas Chammas
I don’t know the details, but I did read that Poetry has a sophisticated dependency resolver. https://github.com/sdispater/poetry I don’t know if there is a way to access the resolver independently of the tool, but perhaps it would provide a handy reference. 2018년 7월 23일 (월) 오전 5:49, Thomas

Re: Review notification bot

2018-07-22 Thread Nicholas Chammas
e worth mentioning nonetheless. On Sat, Jul 14, 2018 at 11:17 AM Holden Karau wrote: > Hearing no objections (and in a shout out to @ Nicholas Chammas who > initially suggested mention-bot back in 2016) I've set up a copy of mention > bot and run it against my own repo (looks like &

[jira] [Resolved] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-27 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved HADOOP-15559. --- Resolution: Fixed > Clarity on Spark compatibility with hadoop-

[jira] [Resolved] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-27 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved HADOOP-15559. --- Resolution: Fixed > Clarity on Spark compatibility with hadoop-

[jira] [Commented] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-27 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525697#comment-16525697 ] Nicholas Chammas commented on HADOOP-15559: --- Looks good to me. I will consider raising

[jira] [Comment Edited] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-26 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524469#comment-16524469 ] Nicholas Chammas edited comment on HADOOP-15559 at 6/27/18 2:27 AM

[jira] [Commented] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-26 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524469#comment-16524469 ] Nicholas Chammas commented on HADOOP-15559: --- Hi [~ste...@apache.org] and thank you

[jira] [Created] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-25 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created HADOOP-15559: - Summary: Clarity on Spark compatibility with hadoop-aws Key: HADOOP-15559 URL: https://issues.apache.org/jira/browse/HADOOP-15559 Project: Hadoop Common

[jira] [Created] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

2018-06-25 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created HADOOP-15559: - Summary: Clarity on Spark compatibility with hadoop-aws Key: HADOOP-15559 URL: https://issues.apache.org/jira/browse/HADOOP-15559 Project: Hadoop Common

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-02 Thread Nicholas Chammas
ri, Jun 1, 2018 at 6:51 PM, Nicholas Chammas > wrote: > > pyspark --packages org.apache.hadoop:hadoop-aws:2.7.3 didn’t work for me > > either (even building with -Phadoop-2.7). I guess I’ve been relying on an > > unsupported pattern and will need to figure something else out

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Nicholas Chammas
. > > On Fri, Jun 1, 2018 at 5:50 PM, Nicholas Chammas > wrote: > > Building with -Phadoop-2.7 didn’t help, and if I remember correctly, > > building with -Phadoop-2.8 worked with hadoop-aws in the 2.3.0 release, > so > > it appears something has changed since then. >

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Nicholas Chammas
f you end up > mixing different versions of Hadoop. > > On Fri, Jun 1, 2018 at 4:01 PM, Nicholas Chammas > wrote: > > I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4 > using > > Flintrock. However, trying to load the hadoop-aws package gave me some &

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Nicholas Chammas
I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4 using Flintrock . However, trying to load the hadoop-aws package gave me some errors. $ pyspark --packages org.apache.hadoop:hadoop-aws:2.8.4 :: problems summary :: WARNINGS

[issue22269] Resolve distutils option conflicts with priorities

2018-05-13 Thread Nicholas Chammas
Change by Nicholas Chammas <nicholas.cham...@gmail.com>: -- nosy: +nchammas ___ Python tracker <rep...@bugs.python.org> <https://bugs.python

Re: Documenting the various DataFrame/SQL join types

2018-05-08 Thread Nicholas Chammas
t; On Tue, May 8, 2018 at 6:13 AM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> The documentation for DataFrame.join() >> <https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join> >> lists all the join types we sup

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-08 Thread Nicholas Chammas
Tue, May 8, 2018 at 6:00 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I certainly can, but the problem I’m facing is that of how best to track >> all the DataFrames I no longer want to persist. >> >> I create and persist various DataFrames throu

Re: eager execution and debuggability

2018-05-08 Thread Nicholas Chammas
This may be technically impractical, but it would be fantastic if we could make it easier to debug Spark programs without needing to rely on eager execution. Sprinkling .count() and .checkpoint() at various points in my code is still a debugging technique I use, but it always makes me wish Spark

[jira] [Comment Edited] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-05-08 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468045#comment-16468045 ] Nicholas Chammas edited comment on SPARK-23945 at 5/8/18 10:22 PM

[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-05-08 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468045#comment-16468045 ] Nicholas Chammas commented on SPARK-23945: -- > So in the grand scheme of things I'd exp

[jira] [Comment Edited] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-05-08 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433316#comment-16433316 ] Nicholas Chammas edited comment on SPARK-23945 at 5/8/18 10:13 PM: --- I

Documenting the various DataFrame/SQL join types

2018-05-08 Thread Nicholas Chammas
The documentation for DataFrame.join() lists all the join types we support: - inner - cross - outer - full - full_outer - left - left_outer - right - right_outer - left_semi

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-08 Thread Nicholas Chammas
? ​ On Thu, May 3, 2018 at 10:26 PM Reynold Xin <r...@databricks.com> wrote: > Why do you need the underlying RDDs? Can't you just unpersist the > dataframes that you don't need? > > > On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas < > nicholas.cham...@gmail.

Identifying specific persisted DataFrames via getPersistentRDDs()

2018-04-30 Thread Nicholas Chammas
This seems to be an underexposed part of the API. My use case is this: I want to unpersist all DataFrames except a specific few. I want to do this because I know at a specific point in my pipeline that I have a handful of DataFrames that I need, and everything else is no longer needed. The

Re: Correlated subqueries in the DataFrame API

2018-04-27 Thread Nicholas Chammas
ol from source") >> val df = table.filter($"col".isin(subQ.toSet)) >> >> That also distinguishes between a sub-query and a correlated sub-query >> that uses values from the outer query. We would still need to come up with >> syntax for the correlated case, unless

[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-04-10 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433316#comment-16433316 ] Nicholas Chammas commented on SPARK-23945: -- I always looked at DataFrames and SQL as two

Correlated subqueries in the DataFrame API

2018-04-09 Thread Nicholas Chammas
I just submitted SPARK-23945 but wanted to double check here to make sure I didn't miss something fundamental. Correlated subqueries are tracked at a high level in SPARK-18455 , but it's not

[jira] [Updated] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-04-09 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-23945: - Description: In SQL you can filter rows based on the result of a subquery: {code:java

[jira] [Created] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-04-09 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-23945: Summary: Column.isin() should accept a single-column DataFrame as input Key: SPARK-23945 URL: https://issues.apache.org/jira/browse/SPARK-23945 Project

[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2018-03-26 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414190#comment-16414190 ] Nicholas Chammas commented on SPARK-22513: -- Thanks for the breakdown. This will be handy

Re: Changing how we compute release hashes

2018-03-23 Thread Nicholas Chammas
To close the loop here: SPARK-23716 <https://issues.apache.org/jira/browse/SPARK-23716> On Fri, Mar 16, 2018 at 5:00 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > OK, will do. > > On Fri, Mar 16, 2018 at 4:41 PM Sean Owen <sro...@gmail.com> wrote: >

[jira] [Comment Edited] (SPARK-23716) Change SHA512 style in release artifacts to play nicely with shasum utility

2018-03-23 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412423#comment-16412423 ] Nicholas Chammas edited comment on SPARK-23716 at 3/24/18 5:13 AM

[jira] [Resolved] (SPARK-23716) Change SHA512 style in release artifacts to play nicely with shasum utility

2018-03-23 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-23716. -- Resolution: Won't Fix For my use case, there is no value in updating the Spark release

[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2018-03-23 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412218#comment-16412218 ] Nicholas Chammas commented on SPARK-22513: -- Fair enough. Just as an alternate confirmation

[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-19 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405615#comment-16405615 ] Nicholas Chammas commented on SPARK-23534: -- I don't know what it takes to add a Hadoop 3.0 build

[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-19 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405563#comment-16405563 ] Nicholas Chammas commented on SPARK-23534: -- I believe this ticket is a duplicate of SPARK-23151

[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2018-03-19 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405561#comment-16405561 ] Nicholas Chammas commented on SPARK-22513: -- [~srowen] - Just curious: How do you know that Spark

[jira] [Created] (SPARK-23716) Change SHA512 style in release artifacts to play nicely with shasum utility

2018-03-16 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-23716: Summary: Change SHA512 style in release artifacts to play nicely with shasum utility Key: SPARK-23716 URL: https://issues.apache.org/jira/browse/SPARK-23716

Re: Changing how we compute release hashes

2018-03-16 Thread Nicholas Chammas
t; On Fri, Mar 16, 2018 at 1:14 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I have sha512sum on my Mac via Homebrew, but yeah as long as the format >> is the same I suppose it doesn’t matter if we use shasum -a or sha512sum. >> >> So shall I fil

Re: Changing how we compute release hashes

2018-03-16 Thread Nicholas Chammas
> *To:* Felix Cheung > *Cc:* rb...@netflix.com; Nicholas Chammas; Spark dev list > > *Subject:* Re: Changing how we compute release hashes > I think the issue with that is that OS X doesn't have "sha512sum". Both it > and Linux have "shasum -a 512" thou

Changing how we compute release hashes

2018-03-15 Thread Nicholas Chammas
To verify that I’ve downloaded a Hadoop release correctly, I can just do this: $ shasum --check hadoop-2.7.5.tar.gz.sha256 hadoop-2.7.5.tar.gz: OK However, since we generate Spark release hashes with GPG

Re: Silencing messages from Ivy when calling spark-submit

2018-03-12 Thread Nicholas Chammas
understand some settings. If you happen to figure > out the answer, please report back here. I'm sure others would find it > useful too. > > Bryan > > On Mon, Mar 5, 2018 at 3:50 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Oh, I didn't kn

[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2018-03-07 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389723#comment-16389723 ] Nicholas Chammas commented on SPARK-18492: -- [~imranshaik] - This is an open source project. You

Re: Silencing messages from Ivy when calling spark-submit

2018-03-05 Thread Nicholas Chammas
ith "spark.jars.ivySettings" to point to your > ivysettings.xml file. Would that work for you to configure it there? > > Bryan > > On Mon, Mar 5, 2018 at 8:20 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I couldn’t get an answer any

Silencing messages from Ivy when calling spark-submit

2018-03-05 Thread Nicholas Chammas
I couldn’t get an answer anywhere else, so I thought I’d ask here. Is there a way to silence the messages that come from Ivy when you call spark-submit with --packages? (For the record, I asked this question on Stack Overflow .) Would it be a good

[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2018-03-02 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383843#comment-16383843 ] Nicholas Chammas commented on SPARK-18492: -- Are you seeing the same on Spark 2.3.0? Apparently

Re: Please keep s3://spark-related-packages/ alive

2018-03-01 Thread Nicholas Chammas
Marton, Thanks for the tip. (Too bad the docs referenced from the issue I opened with INFRA make no mention of mirrors.cgi.) Matei, A Requester Pays bucket is a good idea. I was trying to avoid

Re: Please keep s3://spark-related-packages/ alive

2018-02-27 Thread Nicholas Chammas
SF projects, like Spark, FWIW. >> > > To clarify, the apache-spark.rb formula in Homebrew uses the Apache > mirror closer.lua script > > > https://github.com/Homebrew/homebrew-core/blob/master/Formula/apache-spark.rb#L4 > >michael > > > >> On Mon, Feb 26

Suppressing output from Apache Ivy (?) when calling spark-submit with --packages

2018-02-27 Thread Nicholas Chammas
I’m not sure whether this is something controllable via Spark, but when you call spark-submit with --packages you get a lot of output. Is there any way to suppress it? Does it come from Apache Ivy? I posted more details about what I’m seeing on Stack Overflow

Please keep s3://spark-related-packages/ alive

2018-02-26 Thread Nicholas Chammas
If you go to the Downloads page and download Spark 2.2.1, you’ll get a link to an Apache mirror. It didn’t use to be this way. As recently as Spark 2.2.0, downloads were served via CloudFront , which was backed by an S3

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-23 Thread Nicholas Chammas
Launched a test cluster on EC2 with Flintrock and ran some simple tests. Building Spark took much longer than usual, but that may just be a fluke. Otherwise, all looks good to me. +1 On Fri, Feb 23, 2018 at 10:55 AM Denny Lee wrote:

Re: Kubernetes: why use init containers?

2018-01-09 Thread Nicholas Chammas
I’d like to point out the output of “git show —stat” for that diff: 29 files changed, 130 insertions(+), 1560 deletions(-) +1 for that and generally for the idea of leveraging spark-submit. You can argue that executors downloading from external servers would be faster than downloading from the

[jira] [Commented] (ORC-152) Saving empty Spark DataFrame via ORC does not preserve schema

2017-12-13 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/ORC-152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289678#comment-16289678 ] Nicholas Chammas commented on ORC-152: -- A link to the matching Spark issue is in the description above

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2017-10-24 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217115#comment-16217115 ] Nicholas Chammas commented on SPARK-13587: -- To follow-up on my [earlier comment|https

Re: Disabling Closed -> Reopened transition for non-committers

2017-10-05 Thread Nicholas Chammas
Whoops, didn’t mean to send that out to the list. Apologies. Somehow, an earlier draft of my email got sent out. Nick 2017년 10월 5일 (목) 오전 9:20, Nicholas Chammas <nicholas.cham...@gmail.com>님이 작성: > The first sign that that conversation was going to go downhill was when > the us

[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-09-15 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16168038#comment-16168038 ] Nicholas Chammas commented on SPARK-17025: -- I take that back. I won't be able to test

Re: Run a specific PySpark test or group of tests

2017-08-16 Thread Nicholas Chammas
om> wrote: This generally works for me to just run tests within a class or even a > single test. Not as flexible as pytest -k, which would be nice.. > > $ SPARK_TESTING=1 bin/pyspark pyspark.sql.tests ArrowTests > On Tue, Aug 15, 2017 at 5:49 AM, Nicholas Chammas < > nicholas.cham...

[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-08-15 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128271#comment-16128271 ] Nicholas Chammas commented on SPARK-17025: -- I'm still interested in this but I won't be able

Re: Run a specific PySpark test or group of tests

2017-08-15 Thread Nicholas Chammas
if > I understood correctly. > > > 2017-08-15 3:06 GMT+09:00 Nicholas Chammas <nicholas.cham...@gmail.com>: > >> Say you’re working on something and you want to rerun the PySpark tests, >> focusing on a specific test or group of tests. Is there a way to do that? >> >&g

Run a specific PySpark test or group of tests

2017-08-14 Thread Nicholas Chammas
Say you’re working on something and you want to rerun the PySpark tests, focusing on a specific test or group of tests. Is there a way to do that? I know that you can test entire modules with this: ./python/run-tests --modules pyspark-sql But I’m looking for something more granular, like

[jira] [Created] (SPARK-21712) Clarify PySpark Column.substr() type checking error message

2017-08-11 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-21712: Summary: Clarify PySpark Column.substr() type checking error message Key: SPARK-21712 URL: https://issues.apache.org/jira/browse/SPARK-21712 Project: Spark

Re: Trouble with PySpark UDFs and SPARK_HOME only on EMR

2017-06-22 Thread Nicholas Chammas
Here’s a repro for a very similar issue where Spark hangs on the UDF, which I think is related to the SPARK_HOME issue. I posted the repro on the EMR forum , but in case you can’t access it: 1. I’m running EMR 5.6.0, Spark 2.1.1, and

[jira] [Commented] (SPARK-21110) Structs should be usable in inequality filters

2017-06-22 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16059536#comment-16059536 ] Nicholas Chammas commented on SPARK-21110: -- cc [~marmbrus] - Assuming this is a valid feature

[jira] [Updated] (SPARK-21110) Structs should be usable in inequality filters

2017-06-15 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-21110: - Summary: Structs should be usable in inequality filters (was: Structs should

[jira] [Created] (SPARK-21110) Structs should be orderable

2017-06-15 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-21110: Summary: Structs should be orderable Key: SPARK-21110 URL: https://issues.apache.org/jira/browse/SPARK-21110 Project: Spark Issue Type: Improvement

[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2017-06-02 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035062#comment-16035062 ] Nicholas Chammas commented on SPARK-12661: -- I think we are good to resolve this provided

[jira] [Commented] (SPARK-9862) Join: Handling data skew

2017-05-22 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020030#comment-16020030 ] Nicholas Chammas commented on SPARK-9862: - Is this issue meant to be a SQL-equivalent of SPARK

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-20 Thread Nicholas Chammas
Steve, I think you're a good person to ask about this. Is the below any cause for concern? Or did I perhaps test this incorrectly? Nick On Tue, Apr 18, 2017 at 11:50 PM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > I had trouble starting up a shell with the AWS packa

Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-18 Thread Nicholas Chammas
I had trouble starting up a shell with the AWS package loaded (specifically, org.apache.hadoop:hadoop-aws:2.7.3): [NOT FOUND ] com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms) local-m2-cache: tried

Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nicholas Chammas
rty to > set "spark.scheduler.pool" to something other than the default pool before > a particular Job intended to use that pool is started via that SparkContext. > > On Wed, Apr 5, 2017 at 1:11 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > > Hmm, so

Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nicholas Chammas
Hmm, so when I submit an application with `spark-submit`, I need to guarantee it resources using YARN queues and not Spark's scheduler pools. Is that correct? When are Spark's scheduler pools relevant/useful in this context? On Wed, Apr 5, 2017 at 3:54 PM Mark Hamstra

[jira] [Comment Edited] (SPARK-19553) Add GroupedData.countApprox()

2017-03-14 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870780#comment-15870780 ] Nicholas Chammas edited comment on SPARK-19553 at 3/14/17 2:38 PM

Re: Question on Spark's graph libraries roadmap

2017-03-13 Thread Nicholas Chammas
ty may force developers to more stable kind of API > / platforms & roadmaps. > > > > Thanks Enzo > > On 13 Mar 2017, at 22:09, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > > Your question is answered here under "Will GraphFrames be part of

Re: Question on Spark's graph libraries roadmap

2017-03-13 Thread Nicholas Chammas
Your question is answered here under "Will GraphFrames be part of Apache Spark?", no? http://graphframes.github.io/#what-are-graphframes Nick On Mon, Mar 13, 2017 at 4:56 PM enzo wrote: > Please see this email trail: no answer so far on the user@spark board.

[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-03-02 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893703#comment-15893703 ] Nicholas Chammas commented on SPARK-15474: -- cc [~owen.omalley] > ORC data source fa

[jira] [Commented] (SPARK-19578) Poor pyspark performance + incorrect UI input-size metrics

2017-03-01 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890930#comment-15890930 ] Nicholas Chammas commented on SPARK-19578: -- Makes sense to me. I suppose the Apache Arrow

[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-03-01 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890639#comment-15890639 ] Nicholas Chammas commented on SPARK-15474: -- There is a related discussion on ORC-152 which

[jira] [Commented] (SPARK-19578) Poor pyspark performance + incorrect UI input-size metrics

2017-03-01 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890588#comment-15890588 ] Nicholas Chammas commented on SPARK-19578: -- [~holdenk] - Would it make sense to have PySpark's

[jira] [Commented] (SPARK-18381) Wrong date conversion between spark and python for dates before 1583

2017-02-28 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888776#comment-15888776 ] Nicholas Chammas commented on SPARK-18381: -- Oh, and to provide additional information on why

[jira] [Commented] (SPARK-18381) Wrong date conversion between spark and python for dates before 1583

2017-02-28 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-18381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888752#comment-15888752 ] Nicholas Chammas commented on SPARK-18381: -- I am seeing a very similar issue when trying to read

Re: New Amazon AMIs for EC2 script

2017-02-23 Thread Nicholas Chammas
spark-ec2 has moved to GitHub and is no longer part of the Spark project. A related issue from the current issue tracker that you may want to follow/comment on is this one: https://github.com/amplab/spark-ec2/issues/74 As I said there, I think requiring custom AMIs is one of the major maintenance

Will .count() always trigger an evaluation of each row?

2017-02-17 Thread Nicholas Chammas
Especially during development, people often use .count() or .persist().count() to force evaluation of all rows — exposing any problems, e.g. due to bad data — and to load data into cache to speed up subsequent operations. But as the optimizer gets smarter, I’m guessing it will eventually learn

[jira] [Commented] (SPARK-19553) Add GroupedData.countApprox()

2017-02-16 Thread Nicholas Chammas (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-19553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870780#comment-15870780 ] Nicholas Chammas commented on SPARK-19553: -- The utility of 1) would be being able to count items

Re: Structured Streaming Spark Summit Demo - Databricks people

2017-02-15 Thread Nicholas Chammas
I don't think this is the right place for questions about Databricks. I'm pretty sure they have their own website with a forum for questions about their product. Maybe this? https://forums.databricks.com/ On Wed, Feb 15, 2017 at 2:34 PM Sam Elamin wrote: > Hey folks >

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Nicholas Chammas
Congratulations, Takuya!  On Mon, Feb 13, 2017 at 2:34 PM Felix Cheung wrote: > Congratulations! > > > -- > *From:* Xuefu Zhang > *Sent:* Monday, February 13, 2017 11:29:12 AM > *To:* Xiao Li > *Cc:* Holden Karau; Reynold

<    1   2   3   4   5   6   7   8   9   10   >