Please add an unsubscribe link to the footer of user list email

2016-06-27 Thread Nicholas Chammas
Howdy, It seems like every week we have at least a couple of people emailing the user list in vain with "Unsubscribe" in the subject, the body, or both. I remember a while back that every email on the user list used to include a footer with a quick link to unsubscribe. It was removed, I believe,

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Nicholas Chammas
For the clueless (like me): https://bahir.apache.org/#home Apache Bahir provides extensions to distributed analytic platforms such as Apache Spark. Initially Apache Bahir will contain streaming connectors that were a part of Apache Spark prior to version 2.0: - streaming-akka -

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Nicholas Chammas
+1 to what Mark said. I've been following this discussion and I don't understand where the sudden "Databricks vs. everybody else" narrative came from. On Mon, Jun 6, 2016 at 11:00 AM Mark Hamstra wrote: > This is not a Databricks vs. The World situation, and the fact

Re: [DISCUSS] Removing or changing maintainer process

2016-06-01 Thread Nicholas Chammas
May 19, 2016 at 11:47 AM Nicholas Chammas nicholas.cham...@gmail.com <http://mailto:nicholas.cham...@gmail.com> wrote: I’ve also heard that we should try to keep some other instructions for > contributors to find the “right” reviewers, so it would be great to see > suggestions on that

Re: [DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Nicholas Chammas
I’ve also heard that we should try to keep some other instructions for contributors to find the “right” reviewers, so it would be great to see suggestions on that. For my part, I’d personally prefer something “automatic”, such as easily tracking who reviewed each patch and having people look at

Re: Question about enabling some of missing rules.

2016-05-15 Thread Nicholas Chammas
t/947b9020b0d621bc97661a0a056297e6889936d3 > > > Thanks! > 2016-05-16 12:05 GMT+09:00 Nicholas Chammas <nicholas.cham...@gmail.com>: > >> Relevant discussion from some time ago: >> https://issues.apache.org/jira/browse/SPARK-3849?focusedCommentId=14168961=com.atla

Re: Question about enabling some of missing rules.

2016-05-15 Thread Nicholas Chammas
Relevant discussion from some time ago: https://issues.apache.org/jira/browse/SPARK-3849?focusedCommentId=14168961=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14168961 In short, if enabling a new style rule requires sweeping changes throughout the code base, then it

Re: Proposal of closing some PRs and maybe some PRs abandoned by its author

2016-05-06 Thread Nicholas Chammas
Alex has built tooling for this btw: https://github.com/databricks/spark-pr-dashboard/pull/71 On Fri, May 6, 2016 at 12:15 PM Ted Yu wrote: > PR #10572 was listed twice. > > In the future, is it possible to include the contributor's handle beside > the PR number so that

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Nicholas Chammas
Relevant: https://github.com/databricks/spark-pr-dashboard/issues/1 A lot of this was discussed a while back when the PR Dashboard was first introduced, and several times before and after that as well. (e.g. August 2014

Re: Spark 1.6.1 packages on S3 corrupt?

2016-04-12 Thread Nicholas Chammas
Yes, this is a known issue. The core devs are already aware of it. [CC dev] FWIW, I believe the Spark 1.6.1 / Hadoop 2.6 package on S3 is not corrupt. It may be the only 1.6.1 package that is not corrupt, though. :/ Nick On Tue, Apr 12, 2016 at 9:00 PM Augustus Hong

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-06 Thread Nicholas Chammas
ploaded them to the spark-related-packages S3 bucket, so hopefully > these packages should be fixed now. > > On Mon, Apr 4, 2016 at 3:37 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Thanks, that was the command. :thumbsup: >> >> On Mon, Apr 4, 2016

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-04 Thread Nicholas Chammas
ished one. > > Unfortunately however, I don't know what tool is used to generate the > > hash and I can't reproduce the format, so I ended up manually > > comparing the hashes. > > > > On Mon, Apr 4, 2016 at 2:39 PM, Nicholas Chammas > > <nicholas.cham...@

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-04 Thread Nicholas Chammas
the root cause is > found. > > On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Just checking in on this again as the builds on S3 are still broken. :/ >> >> Could it have something to do with us moving release-build

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-04 Thread Nicholas Chammas
This is still an issue. The Spark 1.6.1 packages on S3 are corrupt. Is anyone looking into this issue? Is there anything contributors can do to help solve this problem? Nick On Sun, Mar 27, 2016 at 8:49 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > Pingity-ping-p

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-27 Thread Nicholas Chammas
Pingity-ping-pong since this is still a problem. On Thu, Mar 24, 2016 at 4:08 PM Michael Armbrust <mich...@databricks.com> wrote: > Patrick is investigating. > > On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >&g

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-24 Thread Nicholas Chammas
Just checking in on this again as the builds on S3 are still broken. :/ Could it have something to do with us moving release-build.sh <https://github.com/apache/spark/commits/master/dev/create-release/release-build.sh> ? ​ On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas <nich

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-21 Thread Nicholas Chammas
t; confusion, the link I get for a direct download of Spark 1.6.1 / > Hadoop 2.6 is > http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz > > On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas > <nicholas.cham...@gmail.com> wrote: > > I just retried the Spark 1.6.1

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-20 Thread Nicholas Chammas
rable: exiting now > > On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust <mich...@databricks.com> > wrote: > >> Patrick reuploaded the artifacts, so it should be fixed now. >> On Mar 16, 2016 5:48 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com> >

Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Nicholas Chammas
https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz Does anyone else have trouble unzipping this? How did this happen? What I get is: $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file gzip:

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Nicholas Chammas
euploaded the artifacts, so it should be fixed now. > On Mar 16, 2016 5:48 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com> > wrote: > >> Looks like the other packages may also be corrupt. I’m getting the same >> error for the Spark 1.6.1 / Hadoop 2.4 package. &

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Nicholas Chammas
ux, I got: > > $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz > > gzip: stdin: unexpected end of file > tar: Unexpected EOF in archive > tar: Unexpected EOF in archive > tar: Error is not recoverable: exiting now > > On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas < > nichol

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-18 Thread Nicholas Chammas
t; I just experienced the issue, however retrying the download a second > time worked. Could it be that there is some load balancer/cache in > front of the archive and some nodes still serve the corrupt packages? > > On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas > <nicholas.cham..

Re: Mutiple spark contexts

2016-01-27 Thread Nicholas Chammas
There is a lengthy discussion about this on the JIRA: https://issues.apache.org/jira/browse/SPARK-2243 On Wed, Jan 27, 2016 at 1:43 PM Herman van Hövell tot Westerflier < hvanhov...@questtec.nl> wrote: > Just out of curiousity. What is the use case for having multiple active > contexts in a

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
+1 Red Hat supports Python 2.6 on REHL 5 until 2020 , but otherwise yes, Python 2.6 is ancient history and the core Python developers stopped supporting it in 2013. REHL 5 is not a good enough reason to continue support for Python

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
om > > wrote: > >> I don't see a reason Spark 2.0 would need to support Python 2.6. At this >> point, Python 3 should be the default that is encouraged. >> Most organizations acknowledge the 2.7 is common, but lagging behind the >> version they should theoretically use.

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
I think all the slaves need the same (or a compatible) version of Python installed since they run Python code in PySpark jobs natively. On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers <ko...@tresata.com> wrote: > interesting i didnt know that! > > On Tue, Jan 5, 2016 at 5:57 PM, N

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
va 7 and python 2.6, no matter how outdated that is. >>> >>> i dont like it either, but i cannot change it. >>> >>> we currently don't use pyspark so i have no stake in this, but if we did >>> i can assure you we would not upgrade to spark 2.x if python 2.6 was >&g

Re: Downloading Hadoop from s3://spark-related-packages/

2015-12-24 Thread Nicholas Chammas
for automated provisioning/deployments.” That would suffice. But as things stand now, I have to guess and wonder at this stuff. Nick ​ On Thu, Dec 24, 2015 at 5:43 AM Steve Loughran <ste...@hortonworks.com> wrote: > > On 24 Dec 2015, at 05:59, Nicholas Chammas <nicholas.cham...@gma

Re: A proposal for Spark 2.0

2015-12-23 Thread Nicholas Chammas
Yeah, I'd also favor maintaining docs with strictly temporary relevance on JIRA when possible. The wiki is like this weird backwater I only rarely visit. Don't we typically do this kind of stuff with an umbrella issue on JIRA? Tom, wouldn't that work well for you? Nick On Wed, Dec 23, 2015 at

Re: Downloading Hadoop from s3://spark-related-packages/

2015-12-23 Thread Nicholas Chammas
replaced the cgi one from before. Also it looks like the lua one >> also supports `action=download` with a filename argument. So you could >> just do something like >> >> wget >> http://www.apache.org/dyn/closer.lua?filename=hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz

Re: Fastest way to build Spark from scratch

2015-12-08 Thread Nicholas Chammas
fresh EC2 instance a significant chunk of the initial build > time might be due to artifact resolution + downloading. Putting > pre-populated Ivy and Maven caches onto your EC2 machine could shave a > decent chunk of time off that first build. > > On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Cham

Re: Fastest way to build Spark from scratch

2015-12-08 Thread Nicholas Chammas
ou know > when some work in a terminal is ready, so you can do the first-thing-in-the > morning build-of-the-SNAPSHOTS > > mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo > > After that you can work on the modules you care about (via the -pl) > option

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-02 Thread Nicholas Chammas
-0 If spark-ec2 is still a supported part of the project, then we should update its version lists as new releases are made. 1.5.2 had the same issue. https://github.com/apache/spark/blob/v1.6.0-rc1/ec2/spark_ec2.py#L54-L91 (I guess as part of the 2.0 discussions we should continue to discuss

Fastest way to build Spark from scratch

2015-11-23 Thread Nicholas Chammas
Say I want to build a complete Spark distribution against Hadoop 2.6+ as fast as possible from scratch. This is what I’m doing at the moment: ./make-distribution.sh -T 1C -Phadoop-2.6 -T 1C instructs Maven to spin up 1 thread per available core. This takes around 20 minutes on an m3.large

Re: A proposal for Spark 2.0

2015-11-12 Thread Nicholas Chammas
With regards to Machine learning, it would be great to move useful features from MLlib to ML and deprecate the former. Current structure of two separate machine learning packages seems to be somewhat confusing. With regards to GraphX, it would be great to deprecate the use of RDD in GraphX and

Re: A proposal for Spark 2.0

2015-11-10 Thread Nicholas Chammas
> For this reason, I would *not* propose doing major releases to break substantial API's or perform large re-architecting that prevent users from upgrading. Spark has always had a culture of evolving architecture incrementally and making changes - and I don't think we want to change this model.

Re: Recommended change to core-site.xml template

2015-11-05 Thread Nicholas Chammas
Thanks for sharing this, Christian. What build of Spark are you using? If I understand correctly, if you are using Spark built against Hadoop 2.6+ then additional configs alone won't help because additional libraries also need to be installed .

Re: Recommended change to core-site.xml template

2015-11-05 Thread Nicholas Chammas
ly helps with this as well. > Without the instance-profile, we got it working by copying a > .aws/credentials file up to each node. We could easily automate that > through the templates. > > I don't need any additional libraries. We just need to change the > core-site.xml > > -C

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-05 Thread Nicholas Chammas
-0 The spark-ec2 version is still set to 1.5.1 . Nick On Wed, Nov 4, 2015 at 8:20 PM Egor Pahomov wrote: > +1 > > Things, which our infrastructure use and I checked: > > Dynamic allocation > Spark

Re: Recommended change to core-site.xml template

2015-11-05 Thread Nicholas Chammas
<https://issues.apache.org/jira/browse/SPARK-7442>. On Fri, Nov 6, 2015 at 12:22 AM Christian <engr...@gmail.com> wrote: > Even with the changes I mentioned above? > On Thu, Nov 5, 2015 at 8:10 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Yep, I t

Re: Recommended change to core-site.xml template

2015-11-05 Thread Nicholas Chammas
t; from: spark-1.5.1-bin-hadoop1 > > Are you saying there might be different behavior if I download > spark-1.5.1-hadoop-2.6 and create my cluster? > > On Thu, Nov 5, 2015 at 1:28 PM, Christian <engr...@gmail.com> wrote: > >> Spark 1.5.1-hadoop1 >> >> On

Re: Downloading Hadoop from s3://spark-related-packages/

2015-11-01 Thread Nicholas Chammas
Nick ​ On Sun, Nov 1, 2015 at 5:32 PM Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > On Sun, Nov 1, 2015 at 2:16 PM, Nicholas Chammas > <nicholas.cham...@gmail.com> wrote: > > OK, I’ll focus on the Apache mirrors going forward. > > > > Th

Re: Downloading Hadoop from s3://spark-related-packages/

2015-11-01 Thread Nicholas Chammas
d > just do something like > > wget > http://www.apache.org/dyn/closer.lua?filename=hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz=download > > Thanks > Shivaram > > On Sun, Nov 1, 2015 at 3:18 PM, Nicholas Chammas > <nicholas.cham...@gmail.com> wrote: > &

Re: Downloading Hadoop from s3://spark-related-packages/

2015-11-01 Thread Nicholas Chammas
; On Sun, Nov 1, 2015 at 2:30 AM, Steve Loughran <ste...@hortonworks.com> > wrote: > > > > On 1 Nov 2015, at 03:17, Nicholas Chammas <nicholas.cham...@gmail.com> > > wrote: > > > > https://s3.amazonaws.com/spark-related-packages/ > > > > spa

Downloading Hadoop from s3://spark-related-packages/

2015-10-31 Thread Nicholas Chammas
https://s3.amazonaws.com/spark-related-packages/ spark-ec2 uses this bucket to download and install HDFS on clusters. Is it owned by the Spark project or by the AMPLab? Anyway, it looks like the latest Hadoop install available on there is Hadoop 2.4.0. Are there plans to add newer versions of

Re: SPARK_MASTER_IP actually expects a DNS name, not IP address

2015-10-16 Thread Nicholas Chammas
t-master.sh -h xxx.xxx.xxx.xxx > > and then use the IP when you start the slaves: > > sbin/start-slave.sh spark://xxx.xxx.xxx.xxx.7077 > > ? > > Regards > JB > > On 10/16/2015 06:01 PM, Nicholas Chammas wrote: > > I'd look into tracing a possible bug here, but I'm no

Re: SPARK_MASTER_IP actually expects a DNS name, not IP address

2015-10-16 Thread Nicholas Chammas
/28162991/cant-run-spark-1-2-in-standalone-mode-on-mac > http://stackoverflow.com/questions/29412157/passing-hostname-to-netty > > FYI > > On Wed, Oct 14, 2015 at 7:10 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I’m setting the Spark maste

Re: SPARK_MASTER_IP actually expects a DNS name, not IP address

2015-10-16 Thread Nicholas Chammas
Nick ​ On Fri, Oct 16, 2015 at 12:05 PM Sean Owen <so...@cloudera.com> wrote: > It's used in scripts like sbin/start-master.sh > > On Fri, Oct 16, 2015 at 5:01 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I'd look into tracing a possible

Re: [ANNOUNCE] Announcing Spark 1.5.1

2015-10-11 Thread Nicholas Chammas
You can find the source tagged for release on GitHub , as was clearly linked to in the thread to vote on the release (titled "[VOTE] Release Apache Spark 1.5.1 (RC1)"). Is there something about that thread that was unclear? Nick On Sun, Oct

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Nicholas Chammas
now until something changes. If it changes, then those projects > might need to build Spark on their own and host older hadoop versions, etc. > > On Wed, Oct 7, 2015 at 9:59 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Thanks guys. >> >> Regarding

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Nicholas Chammas
, Oct 5, 2015 at 2:41 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Thanks for looking into this Josh. >> >> On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen <joshro...@databricks.com> >> wrote: >> >>> I'm working on a fix for th

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Nicholas Chammas
on't upload new artifacts with different SHAs for the > builds which *did* succeed). > > I expect to have this finished in the next day or so; I'm currently > blocked by some infra downtime but expect that to be resolved soon. > > - Josh > > On Mon, Oct 5, 2015 at 8:46

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Nicholas Chammas
reaks spark-ec2 script. > > On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> hadoop1 package for Scala 2.10 wasn't in RC1 either: >> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/ >> >> On Sun, Oct 4, 2015 at 5:1

Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-04 Thread Nicholas Chammas
I’m looking here: https://s3.amazonaws.com/spark-related-packages/ I believe this is where one set of official packages is published. Please correct me if this is not the case. It appears that almost every version of Spark up to and including 1.5.0 has included a --bin-hadoop1.tgz release (e.g.

Re: How to get the HDFS path for each RDD

2015-09-27 Thread Nicholas Chammas
Shouldn't this discussion be held on the user list and not the dev list? The dev list (this list) is for discussing development on Spark itself. Please move the discussion accordingly. Nick 2015년 9월 27일 (일) 오후 10:57, Fengdong Yu 님이 작성: > Hi Anchit, > cat you create

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-28 Thread Nicholas Chammas
/forms/erct2s6KRR As noted before, your results are anonymous and public. Thanks again for participating! I hope this has been useful to the community. Nick On Tue, Aug 25, 2015 at 1:31 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: Final chance to fill out the survey! http://goo.gl

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-25 Thread Nicholas Chammas
Final chance to fill out the survey! http://goo.gl/forms/erct2s6KRR I'm gonna close it to new responses tonight and send out a summary of the results. Nick On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm planning to close the survey to further responses

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-20 Thread Nicholas Chammas
, Aug 17, 2015 at 11:09 AM Nicholas Chammas nicholas.cham...@gmail.com wrote: Howdy folks! I’m interested in hearing about what people think of spark-ec2 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the formal JIRA process. Your answers will all be anonymous and public

[survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Nicholas Chammas
Howdy folks! I’m interested in hearing about what people think of spark-ec2 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the formal JIRA process. Your answers will all be anonymous and public. If the embedded form below doesn’t work for you, you can use this link to get the

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Nicholas Chammas
See: https://issues.apache.org/jira/browse/SPARK-3533 Feel free to comment there and make a case if you think the issue should be reopened. Nick On Fri, Aug 14, 2015 at 11:11 AM Abhishek R. Singh abhis...@tetrationanalytics.com wrote: A workaround would be to have multiple passes on the RDD

Re: Unsubscribe

2015-08-03 Thread Nicholas Chammas
The way to do that is to follow the Unsubscribe link here for dev@spark: http://spark.apache.org/community.html We can't drop you. You have to do it yourself. Nick On Mon, Aug 3, 2015 at 1:54 PM Trevor Grant trevor.d.gr...@gmail.com wrote: Please drop me from this list Trevor Grant Data

Re: Should spark-ec2 get its own repo?

2015-08-02 Thread Nicholas Chammas
On Sat, Aug 1, 2015 at 1:09 PM Matt Goodman meawo...@gmail.com wrote: I am considering porting some of this to a more general spark-cloud launcher, including google/aliyun/rackspace. It shouldn't be hard at all given the current approach for setup/install. FWIW, there are already some tools

Re: Should spark-ec2 get its own repo?

2015-07-13 Thread Nicholas Chammas
At a high level I see the spark-ec2 scripts as an effort to provide a reference implementation for launching EC2 clusters with Apache Spark On a side note, this is precisely how I used spark-ec2 for a personal project that does something similar: reference implementation. Nick 2015년 7월 13일 (월)

Should spark-ec2 get its own repo?

2015-07-03 Thread Nicholas Chammas
spark-ec2 is kind of a mini project within a project. It’s composed of a set of EC2 AMIs https://github.com/mesos/spark-ec2/tree/branch-1.4/ami-list under someone’s account (maybe Patrick’s?) plus the following 2 code bases: - Main command line tool:

Re: Stats on targets for 1.5.0

2015-06-19 Thread Nicholas Chammas
I think it would be fantastic if this work was burned down before adding big new chunks of work. The stat is worth keeping an eye on. +1, keeping in mind that burning down work also means just targeting it for a different release or closing it. :) Nick On Fri, Jun 19, 2015 at 3:18 PM Sean

Re: Sidebar: issues targeted for 1.4.0

2015-06-18 Thread Nicholas Chammas
Given fixed time, adding more TODOs generally means other stuff has to be taken out for the release. If not, then it happens de facto anyway, which is worse than managing it on purpose. +1 to this. I wouldn't mind helping go through open issues on JIRA targeted for the next release around RC

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Nicholas Chammas
I'm personally in favor, but I don't have a sense of how many people still rely on Hadoop 1. Nick 2015년 6월 12일 (금) 오전 9:13, Steve Loughran ste...@hortonworks.com님이 작성: +1 for 2.2+ Not only are the APis in Hadoop 2 better, there's more people testing Hadoop 2.x spark, and bugs in Hadoop

Re: [PySpark DataFrame] When a Row is not a Row

2015-05-13 Thread Nicholas Chammas
the columns. Basically, the rows are just named tuples (called `Row`). -- Davies Liu Sent with Sparrow http://www.sparrowmailapp.com/?sig 已使用 Sparrow http://www.sparrowmailapp.com/?sig 在 2015年5月12日 星期二,上午4:49,Nicholas Chammas 写道: This is really strange. # Spark 1.3.1 print type(results

Re: @since version tag for all dataframe/sql methods

2015-05-13 Thread Nicholas Chammas
Are we not doing the same thing for the Python API? On Wed, May 13, 2015 at 10:43 AM Olivier Girardot ssab...@gmail.com wrote: that's a great idea ! Le mer. 13 mai 2015 à 07:38, Reynold Xin r...@databricks.com a écrit : I added @since version tag for all public dataframe/sql

Re: How to link code pull request with JIRA ID?

2015-05-13 Thread Nicholas Chammas
...@gmail.com wrote: following up from Nicholas, it is [SPARK-12345] Your PR description where 12345 is the jira number. One thing I tend to forget is when/where to include the subproject tag e.g. [MLLIB] 2015-05-13 11:11 GMT-07:00 Nicholas Chammas nicholas.cham

Re: Adding/Using More Resolution Types on JIRA

2015-05-12 Thread Nicholas Chammas
I tend to find that any large project has a lot of walking dead JIRAs, and pretending they are simply Open causes problems. Any state is better for these, so I favor this. Agreed. 1. Inactive: A way to clear out inactive/dead JIRA’s without indicating a decision has been made one way or

[PySpark DataFrame] When a Row is not a Row

2015-05-11 Thread Nicholas Chammas
This is really strange. # Spark 1.3.1 print type(results) class 'pyspark.sql.dataframe.DataFrame' a = results.take(1)[0] print type(a) class 'pyspark.sql.types.Row' print pyspark.sql.types.Row class 'pyspark.sql.types.Row' print type(a) == pyspark.sql.types.Row False print

PySpark DataFrame: Preserving nesting when selecting a nested field

2015-05-09 Thread Nicholas Chammas
Take a look: df = sqlContext.jsonRDD(sc.parallelize(['{settings: {os: OS X, version: 10.10}}'])) df.printSchema() root |-- settings: struct (nullable = true) ||-- os: string (nullable = true) ||-- version: string (nullable = true) # Now I want to drop the version column by #

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-09 Thread Nicholas Chammas
I've opened an issue for a few doc fixes that the PySpark DataFrame API needs: SPARK-7505 https://issues.apache.org/jira/browse/SPARK-7505 On Fri, May 8, 2015 at 3:10 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, neat. So in the example I gave earlier, I’d do this to get columns

Re: Having pyspark.sql.types.StructType implement __iter__()

2015-05-09 Thread Nicholas Chammas
I've filed SPARK-7507 https://issues.apache.org/jira/browse/SPARK-7507 for this. On Fri, May 8, 2015 at 5:57 PM Reynold Xin r...@databricks.com wrote: Sure. On Fri, May 8, 2015 at 2:43 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: StructType looks an awful lot like a Python

Re: pyspark.sql.types.StructType.fromJson() is a lie

2015-05-09 Thread Nicholas Chammas
I've reported this in SPARK-7506 https://issues.apache.org/jira/browse/SPARK-7506. On Thu, May 7, 2015 at 6:58 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: Renaming fields to get around SPARK-2775 https://issues.apache.org/jira/browse/SPARK-2775. I’m doing this clunky thing: 1

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Nicholas Chammas
And a link to SPARK-7035 https://issues.apache.org/jira/browse/SPARK-7035 (which Xiangrui mentioned in his initial email) for the lazy. On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng men...@gmail.com wrote: On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote:

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Nicholas Chammas
...@gmail.com wrote: To add to the above discussion, Pandas, allows suffixing and prefixing to solve this issue http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html Rakesh On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: DataFrames

DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Nicholas Chammas
DataFrames, as far as I can tell, don’t have an equivalent to SQL’s table aliases. This is essential when joining dataframes that have identically named columns. # PySpark 1.3.1 df1 = sqlContext.jsonRDD(sc.parallelize(['{a: 4, other: I know}'])) df2 = sqlContext.jsonRDD(sc.parallelize(['{a:

Re: branch-1.4 nightly builds?

2015-05-08 Thread Nicholas Chammas
https://issues.apache.org/jira/browse/SPARK-1517 That issue should probably be unassigned since I am not actively working on it. (I can't unassign myself.) Nick On Fri, May 8, 2015 at 5:38 PM Punyashloka Biswal punya.bis...@gmail.com wrote: Dear Spark devs, Does anyone maintain nightly

Re: pyspark.sql.types.StructType.fromJson() is a lie

2015-05-07 Thread Nicholas Chammas
a bug than feature. On Thu, May 7, 2015 at 1:55 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Observe, my fellow Sparkophiles (Spark 1.3.1): json_rdd = sqlContext.jsonRDD(sc.parallelize(['{name: Nick}'])) json_rdd.schema StructType(List(StructField(name,StringType,true))) type

pyspark.sql.types.StructType.fromJson() is a lie

2015-05-07 Thread Nicholas Chammas
Observe, my fellow Sparkophiles (Spark 1.3.1): json_rdd = sqlContext.jsonRDD(sc.parallelize(['{name: Nick}'])) json_rdd.schema StructType(List(StructField(name,StringType,true))) type(json_rdd.schema) class 'pyspark.sql.types.StructType' json_rdd.schema.json()

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Nicholas Chammas
/remotecontent?filepath=org/apache/hadoop/hadoop-aws/2.6.0/hadoop-aws-2.6.0.jar And add: export CLASSPATH=$CLASSPATH:hadoop-aws-2.6.0.jar And try to relaunch. Thanks, Peter Rudenko On 2015-05-07 19:30, Nicholas Chammas wrote: Hmm, I just tried changing s3n to s3a: py4j.protocol.Py4JJavaError

Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Nicholas Chammas
Details are here: https://issues.apache.org/jira/browse/SPARK-7442 It looks like something specific to building against Hadoop 2.6? Nick

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Nicholas Chammas
(Hortonworks, Cloudera, MapR) do you use? Thanks, Peter Rudenko On 2015-05-07 19:25, Nicholas Chammas wrote: Details are here: https://issues.apache.org/jira/browse/SPARK-7442 It looks like something specific to building against Hadoop 2.6? Nick

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Nicholas Chammas
/3271168 So for now need to manually add that jar to classpath on hadoop-2.6. Thanks, Peter Rudenko On 2015-05-07 19:41, Nicholas Chammas wrote: I can try that, but the issue is I understand this is supposed to work out of the box (like it does with all the other Spark/Hadoop pre-built

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Nicholas Chammas
I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our maintenance burden and let us use more modern patterns

Re: [discuss] ending support for Java 6?

2015-04-30 Thread Nicholas Chammas
Nicholas Chammas nicholas.cham...@gmail.com wrote: I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Nicholas Chammas
You can check JIRA for any existing plans. If there isn't any, then feel free to create a JIRA and make the case there for why this would be a good feature to add. Nick On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Nicholas Chammas
it, and it would be therefore useful for me to translate Pandas code to Spark... Isn't the goal of Spark Dataframe to allow all the features of Pandas/R Dataframe using Spark ? Regards, Olivier. Le mer. 29 avr. 2015 à 21:09, Nicholas Chammas nicholas.cham...@gmail.com a écrit : You can check

Re: Design docs: consolidation and discoverability

2015-04-27 Thread Nicholas Chammas
I like the idea of having design docs be kept up to date and tracked in git. If the Apache repo isn't a good fit, perhaps we can have a separate repo just for design docs? Maybe something like github.com/spark-docs/spark-docs/ ? If there's other stuff we want to track but haven't, perhaps we can

Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread Nicholas Chammas
And unfortunately, many Jenkins executor slots are being taken by stale Spark PRs... On Mon, Apr 27, 2015 at 2:25 PM shane knapp skn...@berkeley.edu wrote: anyways, the build queue is SLAMMED... we're going to need at least a day to catch up w/this. i'll be keeping an eye on system loads and

Re: Spark build time

2015-04-22 Thread Nicholas Chammas
I suggest searching the archives for this list as there were several previous discussions about this problem. JIRA also has several issues related to this. Some pointers: - SPARK-3431 https://issues.apache.org/jira/browse/SPARK-3431: Parallelize Scala/Java test execution -

Re: Should we let everyone set Assignee?

2015-04-22 Thread Nicholas Chammas
To repeat what Patrick said (literally): If an issue is “assigned” to person X, but some other person Y submits a great patch for it, I think we have some obligation to Spark users and to the community to merge the better patch. So the idea of reserving the right to add a feature, it just seems

Is spark-ec2 for production use?

2015-04-21 Thread Nicholas Chammas
Is spark-ec2 intended for spinning up production Spark clusters? I think the answer is no. However, the docs for spark-ec2 https://spark.apache.org/docs/latest/ec2-scripts.html very much leave that possibility open, and indeed I see many people asking questions or opening issues that stem from

Re: Is spark-ec2 for production use?

2015-04-21 Thread Nicholas Chammas
process so if you choose to deploy via bigtop in test/prod/etc you know things have gone through a certain amount of rigor beforehand Nate -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Tuesday, April 21, 2015 12:46 PM To: Nicholas Chammas Cc: Spark dev

Gitter chat room for Spark

2015-04-16 Thread Nicholas Chammas
Would we be interested in having a public chat room? Gitter http://gitter.im offers them for free for open source projects. It's like web-based IRC. Check out the Docker room for example: https://gitter.im/docker/docker And if people prefer to use actual IRC, Gitter offers a bridge for that

Re: wait time between start master and start slaves

2015-04-14 Thread Nicholas Chammas
\ --write-out %{http_code} localhost:8080 )done spark/sbin/start-slaves.sh Turns out that the master typically takes 3-4 seconds to come up. That’s 15 seconds saved. Hurray for yak shaving! Nick ​ On Sun, Apr 12, 2015 at 5:56 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: Oh, good point

Re: wait time between start master and start slaves

2015-04-12 Thread Nicholas Chammas
, SparkUI.DEFAULT_PORT) } Better retrieve effective UI port before probing. Cheers On Sat, Apr 11, 2015 at 2:38 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: So basically, to tell if the master is ready to accept slaves, just poll http://master-node:4040 for an HTTP 200 response? ​ On Sat, Apr

Re: wait time between start master and start slaves

2015-04-11 Thread Nicholas Chammas
to check if the master is up though. I guess we could poll the Master Web UI and see if we get a 200/ok response Shivaram On Fri, Apr 10, 2015 at 8:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Check this out https://github.com/mesos/spark-ec2/blob

<    1   2   3   4   5   >