[jira] [Commented] (SPARK-3849) Automate remaining Spark Code Style Guide rules

2015-03-25 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380603#comment-14380603
 ] 

Nicholas Chammas commented on SPARK-3849:
-

Sounds good.

My quick summary (which does not replace the due diligence just discussed) is 
that we need a way to enable new style rules (Scala at first, but maybe 
Python/R/Java too) on the whole repo.

However, we don't want a new rule coming online to require fixing all 
outstanding problems at once. Rather, we want the rule to check the whole repo 
but fail the patch (via Jenkins) only if code touched in a given patch (i.e. 
from the git diff) failed some style rules.

This will be impossible in cases where rule failures aren't tied to specific 
lines. But when they are (e.g. line too long), we want to line them up against 
the git diff line numbers. If there's overlap, fail the style check for that 
patch and point out the failing rule and line numbers.

This way the repo can incrementally come into compliance with new style rules, 
rather than having to fix everything at once with a single, large, and painful 
patch.

 Automate remaining Spark Code Style Guide rules
 ---

 Key: SPARK-3849
 URL: https://issues.apache.org/jira/browse/SPARK-3849
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Nicholas Chammas

 Style problems continue to take up a large amount of review time, mostly 
 because there are many [Spark Code Style 
 Guide|https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide]
  rules that have not been automated.
 This issue tracks the remaining rules that have not automated.
 To minimize the impact of introducing new rules that would otherwise require 
 sweeping changes across the code base, we should look to *have new rules 
 apply only to new code where possible*. See [this dev list 
 discussion|http://apache-spark-developers-list.1001551.n3.nabble.com/Scalastyle-improvements-large-code-reformatting-td8755.html]
  for more background on this topic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue

2015-03-25 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380464#comment-14380464
 ] 

Nicholas Chammas commented on SPARK-6481:
-

The Spark user can initiate state transitions, but the issue needs to be 
assigned to it in order to do so.

So here's what I'm gonna do, after chatting briefly with Patrick:
* Save the assigned user, if any
* Assign to the Spark user
* Mark as in progress ONLY IF the issue is Open
** I dunno if we want to change the issue state if it doesn't start out as 
Open. Lemme know if you disagree.
* Restore the original assignee, including Unassigned if that's what it was.

Sound good to everybody? I'm going to implement this in the 
[jira_api.py|https://github.com/databricks/spark-pr-dashboard/blob/master/sparkprs/jira_api.py]
 that Josh pointed me to.

 Set In Progress when a PR is opened for an issue
 --

 Key: SPARK-6481
 URL: https://issues.apache.org/jira/browse/SPARK-6481
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Michael Armbrust

 [~pwendell] and I are not sure if this is possible, but it would be really 
 helpful if the JIRA status was updated to In Progress when we do the 
 linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue

2015-03-25 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380585#comment-14380585
 ] 

Nicholas Chammas commented on SPARK-6481:
-

PR for this: https://github.com/databricks/spark-pr-dashboard/pull/49

 Set In Progress when a PR is opened for an issue
 --

 Key: SPARK-6481
 URL: https://issues.apache.org/jira/browse/SPARK-6481
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Michael Armbrust

 [~pwendell] and I are not sure if this is possible, but it would be really 
 helpful if the JIRA status was updated to In Progress when we do the 
 linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue

2015-03-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378838#comment-14378838
 ] 

Nicholas Chammas commented on SPARK-6481:
-

Since there is no guaranteed way to map GitHub usernames to JIRA usernames, 
what should we do about the JIRA assignee?

A JIRA issue needs an assignee in order to be marked In Progress. We can have 
the script:
# always assign the issue to the Apache Spark user
# keep it assigned to whoever has it assigned, if any (this may be different 
from the PR user)
# in the case of no current assignee, assign to Apache Spark just to mark the 
JIRA in progress, then remove assignee

Any preferences [~marmbrus] / [~pwendell]?

 Set In Progress when a PR is opened for an issue
 --

 Key: SPARK-6481
 URL: https://issues.apache.org/jira/browse/SPARK-6481
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Michael Armbrust
Assignee: Nicholas Chammas

 [~pwendell] and I are not sure if this is possible, but it would be really 
 helpful if the JIRA status was updated to In Progress when we do the 
 linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue

2015-03-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378114#comment-14378114
 ] 

Nicholas Chammas commented on SPARK-6481:
-

[~pwendell] - Where is the GitHub JIRA sync script triggered from? I want to 
see how it's invoked, as well as get some way to run the script on demand for 
testing.

 Set In Progress when a PR is opened for an issue
 --

 Key: SPARK-6481
 URL: https://issues.apache.org/jira/browse/SPARK-6481
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Michael Armbrust
Assignee: Nicholas Chammas

 [~pwendell] and I are not sure if this is possible, but it would be really 
 helpful if the JIRA status was updated to In Progress when we do the 
 linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6481) Set In Progress when a PR is opened for an issue

2015-03-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378393#comment-14378393
 ] 

Nicholas Chammas edited comment on SPARK-6481 at 3/24/15 7:07 PM:
--

Ah, thanks for the pointers.

So should that script be removed from the Spark repo?

Also, how would I go about testing changes to {{jira_api.py}} without getting 
credentials?


was (Author: nchammas):
So should that script be removed from the Spark repo?

Also, how would I go about testing changes to {{jira_api.py}} without getting 
credentials?

 Set In Progress when a PR is opened for an issue
 --

 Key: SPARK-6481
 URL: https://issues.apache.org/jira/browse/SPARK-6481
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Michael Armbrust
Assignee: Nicholas Chammas

 [~pwendell] and I are not sure if this is possible, but it would be really 
 helpful if the JIRA status was updated to In Progress when we do the 
 linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue

2015-03-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378393#comment-14378393
 ] 

Nicholas Chammas commented on SPARK-6481:
-

So should that script be removed from the Spark repo?

Also, how would I go about testing changes to {{jira_api.py}} without getting 
credentials?

 Set In Progress when a PR is opened for an issue
 --

 Key: SPARK-6481
 URL: https://issues.apache.org/jira/browse/SPARK-6481
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Michael Armbrust
Assignee: Nicholas Chammas

 [~pwendell] and I are not sure if this is possible, but it would be really 
 helpful if the JIRA status was updated to In Progress when we do the 
 linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue

2015-03-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378436#comment-14378436
 ] 

Nicholas Chammas commented on SPARK-6481:
-

The change Michael/Patrick want is for state transitions, and AFAICT I don't 
have permission to do that with my personal JIRA account.

If my personal account is given the appropriate permissions (need to trigger 
state transitions; need to view project workflow), then certainly I can test 
things out using my personal credentials.

 Set In Progress when a PR is opened for an issue
 --

 Key: SPARK-6481
 URL: https://issues.apache.org/jira/browse/SPARK-6481
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Michael Armbrust
Assignee: Nicholas Chammas

 [~pwendell] and I are not sure if this is possible, but it would be really 
 helpful if the JIRA status was updated to In Progress when we do the 
 linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters

2015-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375929#comment-14375929
 ] 

Nicholas Chammas commented on SPARK-2394:
-

Thank you for posting this information for others!

 Make it easier to read LZO-compressed files from EC2 clusters
 -

 Key: SPARK-2394
 URL: https://issues.apache.org/jira/browse/SPARK-2394
 Project: Spark
  Issue Type: Improvement
  Components: EC2, Input/Output
Affects Versions: 1.0.0
Reporter: Nicholas Chammas
Priority: Minor
  Labels: compression

 Amazon hosts [a large Google n-grams data set on 
 S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is 
 perfect, among other things, for putting together interesting and easily 
 reproducible public demos of Spark's capabilities.
 The problem is that the data set is compressed using LZO, and it is currently 
 more painful than it should be to get your average {{spark-ec2}} cluster to 
 read input compressed in this way.
 This is what one has to go through to get a Spark cluster created with 
 {{spark-ec2}} to read LZO-compressed files:
 # Install the latest LZO release, perhaps via {{yum}}.
 # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build 
 it. To build {{hadoop-lzo}} you need Maven. 
 # Install Maven. For some reason, [you cannot install Maven with 
 {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum],
  so install it manually.
 # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate 
 configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E].
 # Make [the appropriate 
 calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E]
  to {{sc.newAPIHadoopFile}}.
 This seems like a bit too much work for what we're trying to accomplish.
 If we expect this to be a common pattern -- reading LZO-compressed files from 
 a {{spark-ec2}} cluster -- it would be great if we could somehow make this 
 less painful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py

2015-03-23 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6474:

Issue Type: Improvement  (was: Bug)

 Replace image.run with connection.run_instances in spark_ec2.py
 ---

 Key: SPARK-6474
 URL: https://issues.apache.org/jira/browse/SPARK-6474
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Andrew Drozdov
Priority: Minor

 After looking at an issue in Boto [1], ec2.image.Image.run and 
 ec2.connection.EC2Connection.run_instances are similar calls, but 
 run_instances appears to have more features and is more up to date. For 
 example, run_instances has the capability to launch ebs_optimized instances 
 while run does not. The run call is being used in only a couple places in 
 spark_ec2.py, so let's replace it with run_instances.
 [1] https://github.com/boto/boto/issues/3054



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py

2015-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376584#comment-14376584
 ] 

Nicholas Chammas commented on SPARK-6474:
-

This change also fits the pattern of 
[{{request_spot_instances()}}|https://github.com/apache/spark/blob/474d1320c9b93c501710ad1cfa836b8284562a2c/ec2/spark_ec2.py#L542],
 which is called on the connection like {{run_instances()}} as opposed to on an 
{{Image}}.

 Replace image.run with connection.run_instances in spark_ec2.py
 ---

 Key: SPARK-6474
 URL: https://issues.apache.org/jira/browse/SPARK-6474
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Andrew Drozdov
Priority: Minor

 After looking at an issue in Boto [1], ec2.image.Image.run and 
 ec2.connection.EC2Connection.run_instances are similar calls, but 
 run_instances appears to have more features and is more up to date. For 
 example, run_instances has the capability to launch ebs_optimized instances 
 while run does not. The run call is being used in only a couple places in 
 spark_ec2.py, so let's replace it with run_instances.
 [1] https://github.com/boto/boto/issues/3054



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py

2015-03-23 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6474:

Priority: Minor  (was: Major)

 Replace image.run with connection.run_instances in spark_ec2.py
 ---

 Key: SPARK-6474
 URL: https://issues.apache.org/jira/browse/SPARK-6474
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Andrew Drozdov
Priority: Minor

 After looking at an issue in Boto [1], ec2.image.Image.run and 
 ec2.connection.EC2Connection.run_instances are similar calls, but 
 run_instances appears to have more features and is more up to date. For 
 example, run_instances has the capability to launch ebs_optimized instances 
 while run does not. The run call is being used in only a couple places in 
 spark_ec2.py, so let's replace it with run_instances.
 [1] https://github.com/boto/boto/issues/3054



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py

2015-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376572#comment-14376572
 ] 

Nicholas Chammas edited comment on SPARK-6474 at 3/23/15 8:29 PM:
--

LGTM. Just setting the Priority to Minor since this doesn't cause any major 
problems, though it should be fixed.


was (Author: nchammas):
LGTM.

 Replace image.run with connection.run_instances in spark_ec2.py
 ---

 Key: SPARK-6474
 URL: https://issues.apache.org/jira/browse/SPARK-6474
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Andrew Drozdov
Priority: Minor

 After looking at an issue in Boto [1], ec2.image.Image.run and 
 ec2.connection.EC2Connection.run_instances are similar calls, but 
 run_instances appears to have more features and is more up to date. For 
 example, run_instances has the capability to launch ebs_optimized instances 
 while run does not. The run call is being used in only a couple places in 
 spark_ec2.py, so let's replace it with run_instances.
 [1] https://github.com/boto/boto/issues/3054



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py

2015-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376572#comment-14376572
 ] 

Nicholas Chammas commented on SPARK-6474:
-

LGTM.

 Replace image.run with connection.run_instances in spark_ec2.py
 ---

 Key: SPARK-6474
 URL: https://issues.apache.org/jira/browse/SPARK-6474
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Andrew Drozdov

 After looking at an issue in Boto [1], ec2.image.Image.run and 
 ec2.connection.EC2Connection.run_instances are similar calls, but 
 run_instances appears to have more features and is more up to date. For 
 example, run_instances has the capability to launch ebs_optimized instances 
 while run does not. The run call is being used in only a couple places in 
 spark_ec2.py, so let's replace it with run_instances.
 [1] https://github.com/boto/boto/issues/3054



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue

2015-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377034#comment-14377034
 ] 

Nicholas Chammas commented on SPARK-6481:
-

I'm guessing this will be done via 
[github_jira_sync.py|https://github.com/apache/spark/blob/master/dev/github_jira_sync.py].
 OK, will take a look this week.

 Set In Progress when a PR is opened for an issue
 --

 Key: SPARK-6481
 URL: https://issues.apache.org/jira/browse/SPARK-6481
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Michael Armbrust
Assignee: Nicholas Chammas

 [~pwendell] and I are not sure if this is possible, but it would be really 
 helpful if the JIRA status was updated to In Progress when we do the 
 linking to an open pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[issue21423] concurrent.futures.ThreadPoolExecutor/ProcessPoolExecutor should accept an initializer argument

2015-03-20 Thread Nicholas Chammas

Changes by Nicholas Chammas nicholas.cham...@gmail.com:


--
nosy: +Nicholas Chammas

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue21423
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Nabble is a third-party site that tries its best to archive mail sent out
over the list. Nothing guarantees it will be in sync with the real mailing
list.

To get the truth on what was sent over this, Apache-managed list, you
unfortunately need to go the Apache archives:
http://mail-archives.apache.org/mod_mbox/spark-user/

Nick

On Thu, Mar 19, 2015 at 5:18 AM Ted Yu yuzhih...@gmail.com wrote:

 There might be some delay:


 http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responsessubj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view


 On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
 wrote:

 Thanks, Ted. Well, so far even there I'm only seeing my post and not, for
 example, your response.

 On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu yuzhih...@gmail.com wrote:

 Was this one of the threads you participated ?
 http://search-hadoop.com/m/JW1q5w0p8x1

 You should be able to find your posts on search-hadoop.com

 On Wed, Mar 18, 2015 at 3:21 PM, dgoldenberg dgoldenberg...@gmail.com
 wrote:

 Sorry if this is a total noob question but is there a reason why I'm only
 seeing folks' responses to my posts in emails but not in the browser view
 under apache-spark-user-list.1001560.n3.nabble.com?  Is this a matter of
 setting your preferences such that your responses only go to email and
 never
 to the browser-based view of the list? I don't seem to see such a
 preference...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-User-List-people-s-responses-not-showing-in-the-browser-view-tp22135.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Sure, you can use Nabble or search-hadoop or whatever you prefer.

My point is just that the source of truth are the Apache archives, and
these other sites may or may not be in sync with that truth.

On Thu, Mar 19, 2015 at 10:20 AM Ted Yu yuzhih...@gmail.com wrote:

 I prefer using search-hadoop.com which provides better search capability.

 Cheers

 On Thu, Mar 19, 2015 at 6:48 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Nabble is a third-party site that tries its best to archive mail sent out
 over the list. Nothing guarantees it will be in sync with the real mailing
 list.

 To get the truth on what was sent over this, Apache-managed list, you
 unfortunately need to go the Apache archives:
 http://mail-archives.apache.org/mod_mbox/spark-user/

 Nick

 On Thu, Mar 19, 2015 at 5:18 AM Ted Yu yuzhih...@gmail.com wrote:

 There might be some delay:


 http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responsessubj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view


 On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
 wrote:

 Thanks, Ted. Well, so far even there I'm only seeing my post and not,
 for example, your response.

 On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu yuzhih...@gmail.com wrote:

 Was this one of the threads you participated ?
 http://search-hadoop.com/m/JW1q5w0p8x1

 You should be able to find your posts on search-hadoop.com

 On Wed, Mar 18, 2015 at 3:21 PM, dgoldenberg dgoldenberg...@gmail.com
 wrote:

 Sorry if this is a total noob question but is there a reason why I'm
 only
 seeing folks' responses to my posts in emails but not in the browser
 view
 under apache-spark-user-list.1001560.n3.nabble.com?  Is this a matter
 of
 setting your preferences such that your responses only go to email and
 never
 to the browser-based view of the list? I don't seem to see such a
 preference...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-User-List-people-s-responses-not-showing-in-the-browser-view-tp22135.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Yes, that is mostly why these third-party sites have sprung up around the
official archives--to provide better search. Did you try the link Ted
posted?

On Thu, Mar 19, 2015 at 10:49 AM Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:

 It seems that those archives are not necessarily easy to find stuff in. Is
 there a search engine on top of them? so as to find e.g. your own posts
 easily?

 On Thu, Mar 19, 2015 at 10:34 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Sure, you can use Nabble or search-hadoop or whatever you prefer.

 My point is just that the source of truth are the Apache archives, and
 these other sites may or may not be in sync with that truth.

 On Thu, Mar 19, 2015 at 10:20 AM Ted Yu yuzhih...@gmail.com wrote:

 I prefer using search-hadoop.com which provides better search
 capability.

 Cheers

 On Thu, Mar 19, 2015 at 6:48 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Nabble is a third-party site that tries its best to archive mail sent
 out over the list. Nothing guarantees it will be in sync with the real
 mailing list.

 To get the truth on what was sent over this, Apache-managed list, you
 unfortunately need to go the Apache archives:
 http://mail-archives.apache.org/mod_mbox/spark-user/

 Nick

 On Thu, Mar 19, 2015 at 5:18 AM Ted Yu yuzhih...@gmail.com wrote:

 There might be some delay:


 http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responsessubj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view


 On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Ted. Well, so far even there I'm only seeing my post and not,
 for example, your response.

 On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu yuzhih...@gmail.com wrote:

 Was this one of the threads you participated ?
 http://search-hadoop.com/m/JW1q5w0p8x1

 You should be able to find your posts on search-hadoop.com

 On Wed, Mar 18, 2015 at 3:21 PM, dgoldenberg 
 dgoldenberg...@gmail.com wrote:

 Sorry if this is a total noob question but is there a reason why I'm
 only
 seeing folks' responses to my posts in emails but not in the browser
 view
 under apache-spark-user-list.1001560.n3.nabble.com?  Is this a
 matter of
 setting your preferences such that your responses only go to email
 and never
 to the browser-based view of the list? I don't seem to see such a
 preference...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-User-List-people-s-responses-not-showing-in-the-browser-view-tp22135.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








Re: Processing of text file in large gzip archive

2015-03-16 Thread Nicholas Chammas
You probably want to update this line as follows:

lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3)

For more details on why, see this answer
http://stackoverflow.com/a/27631722/877069.

Nick
​

On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier mps@gmail.com wrote:

 1. I don't think textFile is capable of unpacking a .gz file. You need to
 use hadoopFile or newAPIHadoop file for this.


 Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do
 is compute splits on gz files, so if you have a single file, you'll have a
 single partition.

 Processing 30 GB of gzipped data should not take that long, at least with
 the Scala API. Python not sure, especially under 1.2.1.




[jira] [Updated] (SPARK-6342) Leverage cfncluster in spark_ec2

2015-03-15 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6342:

Component/s: EC2

 Leverage cfncluster in spark_ec2 
 -

 Key: SPARK-6342
 URL: https://issues.apache.org/jira/browse/SPARK-6342
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Alex Rothberg
Priority: Minor

 Consider taking advantage of cfncluster 
 (http://cfncluster.readthedocs.org/en/latest/) in the spark_ec2 script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function

2015-03-13 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360534#comment-14360534
 ] 

Nicholas Chammas commented on SPARK-6282:
-

[~joshrosen], [~davies]: Does this error look familiar to you?

 Strange Python import error when using random() in a lambda function
 

 Key: SPARK-6282
 URL: https://issues.apache.org/jira/browse/SPARK-6282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Kubuntu 14.04, Python 2.7.6
Reporter: Pavel Laskov
Priority: Minor

 Consider the exemplary Python code below:
from random import random
from pyspark.context import SparkContext
from xval_mllib import read_csv_file_as_list
 if __name__ == __main__: 
 sc = SparkContext(appName=Random() bug test)
 data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
 #data = sc.parallelize([1, 2, 3, 4, 5], 2)
 d = data.map(lambda x: (random(), x))
 print d.first()
 Data is read from a large CSV file. Running this code results in a Python 
 import error:
 ImportError: No module named _winreg
 If I use 'import random' and 'random.random()' in the lambda function no 
 error occurs. Also no error occurs, for both kinds of import statements, for 
 a small artificial data set like the one shown in a commented line.  
 The full error trace, the source code of csv reading code (function 
 'read_csv_file_as_list' is my own) as well as a sample dataset (the original 
 dataset is about 8M large) can be provided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function

2015-03-12 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359404#comment-14359404
 ] 

Nicholas Chammas commented on SPARK-6282:
-

Shouldn't be related to boto. _winreg appears to be something Python uses to 
access the Windows registry, which is strange.

Please give us more details about your cluster setup, where you are running the 
driver from, etc. Also, what if you try using numpy's implementation of 
{{random}}?

 Strange Python import error when using random() in a lambda function
 

 Key: SPARK-6282
 URL: https://issues.apache.org/jira/browse/SPARK-6282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Kubuntu 14.04, Python 2.7.6
Reporter: Pavel Laskov
Priority: Minor

 Consider the exemplary Python code below:
from random import random
from pyspark.context import SparkContext
from xval_mllib import read_csv_file_as_list
 if __name__ == __main__: 
 sc = SparkContext(appName=Random() bug test)
 data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
 #data = sc.parallelize([1, 2, 3, 4, 5], 2)
 d = data.map(lambda x: (random(), x))
 print d.first()
 Data is read from a large CSV file. Running this code results in a Python 
 import error:
 ImportError: No module named _winreg
 If I use 'import random' and 'random.random()' in the lambda function no 
 error occurs. Also no error occurs, for both kinds of import statements, for 
 a small artificial data set like the one shown in a commented line.  
 The full error trace, the source code of csv reading code (function 
 'read_csv_file_as_list' is my own) as well as a sample dataset (the original 
 dataset is about 8M large) can be provided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master

2015-03-12 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5189:

Description: 
As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
then setting up all the slaves together. This includes broadcasting files from 
the lonely master to potentially hundreds of slaves.

There are 2 main problems with this approach:
# Broadcasting files from the master to all slaves using 
[{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
(e.g. during [ephemeral-hdfs 
init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
 or during [Spark 
setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
 takes a long time. This time increases as the number of slaves increases.
 I did some testing in {{us-east-1}}. This is, concretely, what the problem 
looks like:
 || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
| 1 | 8m 44s |
| 10 | 13m 45s |
| 25 | 22m 50s |
| 50 | 37m 30s |
| 75 | 51m 30s |
| 99 | 1h 5m 30s |
 Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, but 
I think the point is clear enough.
# It's more complicated to add slaves to an existing cluster (a la 
[SPARK-2008]), since slaves are only configured through the master during the 
setup of the master itself.

Logically, the operations we want to implement are:

* Provision a Spark node
* Join a node to a cluster (including an empty cluster) as either a master or a 
slave
* Remove a node from a cluster

We need our scripts to roughly be organized to match the above operations. The 
goals would be:
# When launching a cluster, enable all cluster nodes to be provisioned in 
parallel, removing the master-to-slave file broadcast bottleneck.
# Facilitate cluster modifications like adding or removing nodes.
# Enable exploration of infrastructure tools like 
[Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
internals and perhaps even allow us to build [one tool that launches Spark 
clusters on several different cloud 
platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].

More concretely, the modifications we need to make are:
* Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
equivalent, slave-side operations.
* Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it 
fully creates a node that can be used as either a master or slave.
* Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, 
configures it as a master or slave, and joins it to a cluster.
* Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete 
that script.

  was:
As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
then setting up all the slaves together. This includes broadcasting files from 
the lonely master to potentially hundreds of slaves.

There are 2 main problems with this approach:
# Broadcasting files from the master to all slaves using 
[{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
(e.g. during [ephemeral-hdfs 
init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
 or during [Spark 
setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
 takes a long time. This time increases as the number of slaves increases.
# It's more complicated to add slaves to an existing cluster (a la 
[SPARK-2008]), since slaves are only configured through the master during the 
setup of the master itself.

Logically, the operations we want to implement are:

* Provision a Spark node
* Join a node to a cluster (including an empty cluster) as either a master or a 
slave
* Remove a node from a cluster

We need our scripts to roughly be organized to match the above operations. The 
goals would be:
# When launching a cluster, enable all cluster nodes to be provisioned in 
parallel, removing the master-to-slave file broadcast bottleneck.
# Facilitate cluster modifications like adding or removing nodes.
# Enable exploration of infrastructure tools like 
[Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
internals and perhaps even allow us to build [one tool that launches Spark 
clusters on several different cloud 
platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].

More concretely, the modifications we need to make are:
* Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
equivalent, slave-side operations.
* Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it 
fully creates a node that can be used as either a master or slave.
* Create a new script, {{join-to-cluster.sh}}, that takes a provisioned

[jira] [Commented] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master

2015-03-12 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359665#comment-14359665
 ] 

Nicholas Chammas commented on SPARK-5189:
-

For the record, this is the script I used to get the launch time stats above:

{code}
{
python -m timeit -r 6 -n 1 \
--setup 'import subprocess; import time; subprocess.call(yes y | 
./ec2/spark-ec2 destroy launch-test --identity-file /path/to/file.pem 
--key-pair my-pair --region us-east-1, shell=True); time.sleep(60)' \
'subprocess.call(./ec2/spark-ec2 launch launch-test --slaves 99 
--identity-file /path/to/file.pem --key-pair my-pair --region us-east-1 --zone 
us-east-1c --instance-type m3.large, shell=True)'

yes y | ./ec2/spark-ec2 destroy launch-test --identity-file 
/path/to/file.pem --key-pair my-pair --region us-east-1
}
{code}

 Reorganize EC2 scripts so that nodes can be provisioned independent of Spark 
 master
 ---

 Key: SPARK-5189
 URL: https://issues.apache.org/jira/browse/SPARK-5189
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas

 As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
 then setting up all the slaves together. This includes broadcasting files 
 from the lonely master to potentially hundreds of slaves.
 There are 2 main problems with this approach:
 # Broadcasting files from the master to all slaves using 
 [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
 (e.g. during [ephemeral-hdfs 
 init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
  or during [Spark 
 setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
  takes a long time. This time increases as the number of slaves increases.
  I did some testing in {{us-east-1}}. This is, concretely, what the problem 
 looks like:
  || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
 | 1 | 8m 44s |
 | 10 | 13m 45s |
 | 25 | 22m 50s |
 | 50 | 37m 30s |
 | 75 | 51m 30s |
 | 99 | 1h 5m 30s |
  Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, 
 but I think the point is clear enough.
 # It's more complicated to add slaves to an existing cluster (a la 
 [SPARK-2008]), since slaves are only configured through the master during the 
 setup of the master itself.
 Logically, the operations we want to implement are:
 * Provision a Spark node
 * Join a node to a cluster (including an empty cluster) as either a master or 
 a slave
 * Remove a node from a cluster
 We need our scripts to roughly be organized to match the above operations. 
 The goals would be:
 # When launching a cluster, enable all cluster nodes to be provisioned in 
 parallel, removing the master-to-slave file broadcast bottleneck.
 # Facilitate cluster modifications like adding or removing nodes.
 # Enable exploration of infrastructure tools like 
 [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
 internals and perhaps even allow us to build [one tool that launches Spark 
 clusters on several different cloud 
 platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].
 More concretely, the modifications we need to make are:
 * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
 equivalent, slave-side operations.
 * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure 
 it fully creates a node that can be used as either a master or slave.
 * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, 
 configures it as a master or slave, and joins it to a cluster.
 * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete 
 that script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4325) Improve spark-ec2 cluster launch times

2015-03-10 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354956#comment-14354956
 ] 

Nicholas Chammas commented on SPARK-4325:
-

At this point it's more an umbrella task containing any issues that impact 
spark-ec2 cluster launch times. Dunno if that's appropriate, but I've seen 
other issues structured like this.

I'm fine with closing this issue, but it's what I'm using to group issues 
related to the same problem.

 Improve spark-ec2 cluster launch times
 --

 Key: SPARK-4325
 URL: https://issues.apache.org/jira/browse/SPARK-4325
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 There are several optimizations we know we can make to [{{setup.sh}} | 
 https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches 
 faster.
 There are also some improvements to the AMIs that will help a lot.
 Potential improvements:
 * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This 
 will reduce or eliminate SSH wait time and Ganglia init time.
 * Replace instances of {{download; rsync to rest of cluster}} with parallel 
 downloads on all nodes of the cluster.
 * Replace instances of 
  {code}
 for node in $NODES; do
   command
   sleep 0.3
 done
 wait{code}
  with simpler calls to {{pssh}}.
 * Remove the [linear backoff | 
 https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665]
  when we wait for SSH availability now that we are already waiting for EC2 
 status checks to clear before testing SSH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4325) Improve spark-ec2 cluster launch times

2015-03-10 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354939#comment-14354939
 ] 

Nicholas Chammas commented on SPARK-4325:
-

[~srowen] - I should perhaps change the linked issues to contains, since 
SPARK-5189 and SPARK-3821 are where the actual launch time improvements are. 
The subtasks here (1 of which was just resolved as as dup of SPARK-3821), are 
relatively insignificant.

 Improve spark-ec2 cluster launch times
 --

 Key: SPARK-4325
 URL: https://issues.apache.org/jira/browse/SPARK-4325
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 There are several optimizations we know we can make to [{{setup.sh}} | 
 https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches 
 faster.
 There are also some improvements to the AMIs that will help a lot.
 Potential improvements:
 * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This 
 will reduce or eliminate SSH wait time and Ganglia init time.
 * Replace instances of {{download; rsync to rest of cluster}} with parallel 
 downloads on all nodes of the cluster.
 * Replace instances of 
  {code}
 for node in $NODES; do
   command
   sleep 0.3
 done
 wait{code}
  with simpler calls to {{pssh}}.
 * Remove the [linear backoff | 
 https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665]
  when we wait for SSH availability now that we are already waiting for EC2 
 status checks to clear before testing SSH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6246) spark-ec2 can't handle clusters with 100 nodes

2015-03-10 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-6246:
---

 Summary: spark-ec2 can't handle clusters with  100 nodes
 Key: SPARK-6246
 URL: https://issues.apache.org/jira/browse/SPARK-6246
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.3.0
Reporter: Nicholas Chammas
Priority: Minor


This appears to be a new restriction, perhaps resulting from our upgrade of 
boto. Maybe it's a new restriction from EC2. Not sure yet.

We didn't have this issue around the Spark 1.1.0 time frame from what I can 
remember. I'll track down where the issue is and when it started.

Attempting to launch a cluster with 100 slaves yields the following:

{code}
Spark AMI: ami-35b1885c
Launching instances...
Launched 100 slaves in us-east-1c, regid = r-9c408776
Launched master in us-east-1c, regid = r-92408778
Waiting for AWS to propagate instance metadata...
Waiting for cluster to enter 'ssh-ready' state.ERROR:boto:400 Bad Request
ERROR:boto:?xml version=1.0 encoding=UTF-8?
ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the 
maximum number of instance IDs that can be specificied (100). Please specify 
fewer than 100 instance 
IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response
Traceback (most recent call last):
  File ./ec2/spark_ec2.py, line 1338, in module
main()
  File ./ec2/spark_ec2.py, line 1330, in main
real_main()
  File ./ec2/spark_ec2.py, line 1170, in real_main
cluster_state='ssh-ready'
  File ./ec2/spark_ec2.py, line 795, in wait_for_cluster_state
statuses = conn.get_all_instance_status(instance_ids=[i.id for i in 
cluster_instances])
  File /path/apache/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 
737, in get_all_instance_status
InstanceStatusSet, verb='POST')
  File /path/apache/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1204, 
in get_object
raise self.ResponseError(response.status, response.reason, body)
boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
?xml version=1.0 encoding=UTF-8?
ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the 
maximum number of instance IDs that can be specificied (100). Please specify 
fewer than 100 instance 
IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response
{code}

This problem seems to be with {{get_all_instance_status()}}, though I am not 
sure if other methods are affected too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6246) spark-ec2 can't handle clusters with 100 nodes

2015-03-10 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354969#comment-14354969
 ] 

Nicholas Chammas commented on SPARK-6246:
-

FYI [~shivaram].

 spark-ec2 can't handle clusters with  100 nodes
 

 Key: SPARK-6246
 URL: https://issues.apache.org/jira/browse/SPARK-6246
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.3.0
Reporter: Nicholas Chammas
Priority: Minor

 This appears to be a new restriction, perhaps resulting from our upgrade of 
 boto. Maybe it's a new restriction from EC2. Not sure yet.
 We didn't have this issue around the Spark 1.1.0 time frame from what I can 
 remember. I'll track down where the issue is and when it started.
 Attempting to launch a cluster with 100 slaves yields the following:
 {code}
 Spark AMI: ami-35b1885c
 Launching instances...
 Launched 100 slaves in us-east-1c, regid = r-9c408776
 Launched master in us-east-1c, regid = r-92408778
 Waiting for AWS to propagate instance metadata...
 Waiting for cluster to enter 'ssh-ready' state.ERROR:boto:400 Bad Request
 ERROR:boto:?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the 
 maximum number of instance IDs that can be specificied (100). Please specify 
 fewer than 100 instance 
 IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response
 Traceback (most recent call last):
   File ./ec2/spark_ec2.py, line 1338, in module
 main()
   File ./ec2/spark_ec2.py, line 1330, in main
 real_main()
   File ./ec2/spark_ec2.py, line 1170, in real_main
 cluster_state='ssh-ready'
   File ./ec2/spark_ec2.py, line 795, in wait_for_cluster_state
 statuses = conn.get_all_instance_status(instance_ids=[i.id for i in 
 cluster_instances])
   File /path/apache/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 
 737, in get_all_instance_status
 InstanceStatusSet, verb='POST')
   File /path/apache/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 
 1204, in get_object
 raise self.ResponseError(response.status, response.reason, body)
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the 
 maximum number of instance IDs that can be specificied (100). Please specify 
 fewer than 100 instance 
 IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response
 {code}
 This problem seems to be with {{get_all_instance_status()}}, though I am not 
 sure if other methods are affected too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-4325) Improve spark-ec2 cluster launch times

2015-03-10 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-4325:
-

Reopening after updating contains issue links.

 Improve spark-ec2 cluster launch times
 --

 Key: SPARK-4325
 URL: https://issues.apache.org/jira/browse/SPARK-4325
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 There are several optimizations we know we can make to [{{setup.sh}} | 
 https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches 
 faster.
 There are also some improvements to the AMIs that will help a lot.
 Potential improvements:
 * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This 
 will reduce or eliminate SSH wait time and Ganglia init time.
 * Replace instances of {{download; rsync to rest of cluster}} with parallel 
 downloads on all nodes of the cluster.
 * Replace instances of 
  {code}
 for node in $NODES; do
   command
   sleep 0.3
 done
 wait{code}
  with simpler calls to {{pssh}}.
 * Remove the [linear backoff | 
 https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665]
  when we wait for SSH availability now that we are already waiting for EC2 
 status checks to clear before testing SSH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-03-10 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354991#comment-14354991
 ] 

Nicholas Chammas commented on SPARK-6220:
-

Another thought to add, there are options for running instances on dedicated 
hardware and securing provisioned IOPs that we are likely (well, I am likely) 
to use. Those could also grow into top-level options, making our option list 
really long.

If we go with the original suggestion here and provide some generic way to pass 
those options through, perhaps it makes sense to invest in SPARK-925 at the 
same time so that users in most cases would just specify those options in a 
file and not have to fidget with very long command line parameters.

A command-line equivalent for passing options through will still be needed of 
course, but it won't be as big of a deal if people have to type some kind of 
quasi-JSON or YAML since they have the config file as well.

 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I followed the example of {{ssh}}, which supports multiple extended options 
 similarly.
 {code}
 ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5312) Use sbt to detect new or changed public classes in PRs

2015-03-10 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5312:

Description: 
We currently use an [unwieldy grep/sed 
contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
 to detect new public classes in PRs.

-Apparently, sbt lets you get a list of public classes [much more 
directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
{{show compile:discoveredMainClasses}}. We should use that instead.-

There is a tool called [ClassUtil|http://software.clapper.org/classutil/] that 
seems to help give this kind of information much more directly. We should look 
into using that.

  was:
We currently use an [unwieldy grep/sed 
contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
 to detect new public classes in PRs.

Apparently, sbt lets you get a list of public classes [much more 
directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
{{show compile:discoveredMainClasses}}. We should use that instead.


 Use sbt to detect new or changed public classes in PRs
 --

 Key: SPARK-5312
 URL: https://issues.apache.org/jira/browse/SPARK-5312
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 We currently use an [unwieldy grep/sed 
 contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
  to detect new public classes in PRs.
 -Apparently, sbt lets you get a list of public classes [much more 
 directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
 {{show compile:discoveredMainClasses}}. We should use that instead.-
 There is a tool called [ClassUtil|http://software.clapper.org/classutil/] 
 that seems to help give this kind of information much more directly. We 
 should look into using that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs

2015-03-10 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355622#comment-14355622
 ] 

Nicholas Chammas commented on SPARK-5312:
-

Thanks for looking into this [~boyork]. I'm looking forward to see what comes 
of it.

The goal, as you hinted at, is basically to give reviewers a complement to the 
MIMA check that lets them see public API changes for each PR very easily.

 Use sbt to detect new or changed public classes in PRs
 --

 Key: SPARK-5312
 URL: https://issues.apache.org/jira/browse/SPARK-5312
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 We currently use an [unwieldy grep/sed 
 contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
  to detect new public classes in PRs.
 Apparently, sbt lets you get a list of public classes [much more 
 directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
 {{show compile:discoveredMainClasses}}. We should use that instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6246) spark-ec2 can't handle clusters with 100 nodes

2015-03-10 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355642#comment-14355642
 ] 

Nicholas Chammas commented on SPARK-6246:
-

I dunno, I haven't looked into the problem yet (been out all day), but I'm 
surprised that everything else works with  100 nodes: creating nodes, 
destroying them, getting them. It's just the status check call.

If we have to, sure I'll batch the calls. But I suspect there's a better way to 
do things. I'm surprised boto doesn't just abstract this problem away.

Anyway, I'll look into it and report back.

 spark-ec2 can't handle clusters with  100 nodes
 

 Key: SPARK-6246
 URL: https://issues.apache.org/jira/browse/SPARK-6246
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.3.0
Reporter: Nicholas Chammas
Priority: Minor

 This appears to be a new restriction, perhaps resulting from our upgrade of 
 boto. Maybe it's a new restriction from EC2. Not sure yet.
 We didn't have this issue around the Spark 1.1.0 time frame from what I can 
 remember. I'll track down where the issue is and when it started.
 Attempting to launch a cluster with 100 slaves yields the following:
 {code}
 Spark AMI: ami-35b1885c
 Launching instances...
 Launched 100 slaves in us-east-1c, regid = r-9c408776
 Launched master in us-east-1c, regid = r-92408778
 Waiting for AWS to propagate instance metadata...
 Waiting for cluster to enter 'ssh-ready' state.ERROR:boto:400 Bad Request
 ERROR:boto:?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the 
 maximum number of instance IDs that can be specificied (100). Please specify 
 fewer than 100 instance 
 IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response
 Traceback (most recent call last):
   File ./ec2/spark_ec2.py, line 1338, in module
 main()
   File ./ec2/spark_ec2.py, line 1330, in main
 real_main()
   File ./ec2/spark_ec2.py, line 1170, in real_main
 cluster_state='ssh-ready'
   File ./ec2/spark_ec2.py, line 795, in wait_for_cluster_state
 statuses = conn.get_all_instance_status(instance_ids=[i.id for i in 
 cluster_instances])
   File /path/apache/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 
 737, in get_all_instance_status
 InstanceStatusSet, verb='POST')
   File /path/apache/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 
 1204, in get_object
 raise self.ResponseError(response.status, response.reason, body)
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the 
 maximum number of instance IDs that can be specificied (100). Please specify 
 fewer than 100 instance 
 IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response
 {code}
 This problem seems to be with {{get_all_instance_status()}}, though I am not 
 sure if other methods are affected too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5313) Create simple framework for highlighting changes introduced in a PR

2015-03-10 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355819#comment-14355819
 ] 

Nicholas Chammas commented on SPARK-5313:
-

I had an idea to generalize the process of comparing any given property across 
{{master}} and a given PR and displaying the result on the PR. I'll update the 
issue links from contains to relates to, because that's all it is--an 
abstracted way for our Jenkins script to report on PR characteristics.

 Create simple framework for highlighting changes introduced in a PR
 ---

 Key: SPARK-5313
 URL: https://issues.apache.org/jira/browse/SPARK-5313
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 For any given PR, we may want to run a bunch of checks along the following 
 lines: 
 * Show property X of {{master}}
 * Show the same property X of this PR
 * Call out any differences on the GitHub page
 It might be helpful to write a simple function that takes any check -- itself 
 represented as a function -- as input, runs the check on master and the PR, 
 and outputs the diff.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4325) Improve spark-ec2 cluster launch times

2015-03-10 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-4325:

Description: 
This is an umbrella task to capture several pieces of work related to 
significantly improving spark-ec2 cluster launch times.

There are several optimizations we know we can make to [{{setup.sh}} | 
https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches 
faster.

There are also some improvements to the AMIs that will help a lot.

Potential improvements:
* Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This will 
reduce or eliminate SSH wait time and Ganglia init time.
* Replace instances of {{download; rsync to rest of cluster}} with parallel 
downloads on all nodes of the cluster.
* Replace instances of 
 {code}
for node in $NODES; do
  command
  sleep 0.3
done
wait{code}
 with simpler calls to {{pssh}}.
* Remove the [linear backoff | 
https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665]
 when we wait for SSH availability now that we are already waiting for EC2 
status checks to clear before testing SSH.

  was:
There are several optimizations we know we can make to [{{setup.sh}} | 
https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches 
faster.

There are also some improvements to the AMIs that will help a lot.

Potential improvements:
* Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This will 
reduce or eliminate SSH wait time and Ganglia init time.
* Replace instances of {{download; rsync to rest of cluster}} with parallel 
downloads on all nodes of the cluster.
* Replace instances of 
 {code}
for node in $NODES; do
  command
  sleep 0.3
done
wait{code}
 with simpler calls to {{pssh}}.
* Remove the [linear backoff | 
https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665]
 when we wait for SSH availability now that we are already waiting for EC2 
status checks to clear before testing SSH.


 Improve spark-ec2 cluster launch times
 --

 Key: SPARK-4325
 URL: https://issues.apache.org/jira/browse/SPARK-4325
 Project: Spark
  Issue Type: Umbrella
  Components: EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 This is an umbrella task to capture several pieces of work related to 
 significantly improving spark-ec2 cluster launch times.
 There are several optimizations we know we can make to [{{setup.sh}} | 
 https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches 
 faster.
 There are also some improvements to the AMIs that will help a lot.
 Potential improvements:
 * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This 
 will reduce or eliminate SSH wait time and Ganglia init time.
 * Replace instances of {{download; rsync to rest of cluster}} with parallel 
 downloads on all nodes of the cluster.
 * Replace instances of 
  {code}
 for node in $NODES; do
   command
   sleep 0.3
 done
 wait{code}
  with simpler calls to {{pssh}}.
 * Remove the [linear backoff | 
 https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665]
  when we wait for SSH availability now that we are already waiting for EC2 
 status checks to clear before testing SSH.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6219) Expand Python lint checks to check for compilation errors

2015-03-09 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353325#comment-14353325
 ] 

Nicholas Chammas commented on SPARK-6219:
-

That's a good point, I haven't checked to see what's already covered in
that way by unit tests.

At the very least, I can say that this will catch stuff in spark-ec2 and
examples that unit tests currently do not cover.

Also, it runs very, very quickly.



 Expand Python lint checks to check for  compilation errors
 --

 Key: SPARK-6219
 URL: https://issues.apache.org/jira/browse/SPARK-6219
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas
Priority: Minor

 An easy lint check for Python would be to make sure the stuff at least 
 compiles. That will catch only the most egregious errors, but it should help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-03-09 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354217#comment-14354217
 ] 

Nicholas Chammas commented on SPARK-6220:
-

I took another look at the 2 boto methods we'd be passing these options to.
* 
[{{boto.ec2.image.Image.run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
* 
[{{boto.ec2.connection.EC2Connection.request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

The parameter types they take are quite varied, from {{bool}} to {{string}} to 
{{list(string)}} to 
{{list(boto.ec2.networkinterface.NetworkInterfaceSpecification)}}. Covering 
them generically, even just a subset of them, would require us to take input 
that can be type cast somehow--maybe some kind of stripped-down JSON.

I'm not sure we want to do that to spark-ec2.

Maybe instead I should just add the options I need to support 
{{instance_profile_arn}} / {{instance_profile_name}} (for IAM support) and 
{{instance_initiated_shutdown_behavior}} (for self-terminating clusters) and 
call it a day.

[~shivaram], [~joshrosen], [~pwendell]: What do y'all think?

 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I followed the example of {{ssh}}, which supports multiple extended options 
 similarly.
 {code}
 ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6206) spark-ec2 script reporting SSL error?

2015-03-08 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352481#comment-14352481
 ] 

Nicholas Chammas commented on SPARK-6206:
-

OK, let us know what you find, [~Joe6521].

In general, please try to validate your issue on the user list or on Stack 
Overflow before reporting it here, unless you are really sure you've found a 
problem with Spark (as opposed to your environment).

 spark-ec2 script reporting SSL error?
 -

 Key: SPARK-6206
 URL: https://issues.apache.org/jira/browse/SPARK-6206
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
Reporter: Joe O

 I have been using the spark-ec2 script for several months with no problems.
 Recently, when executing a script to launch a cluster I got the following 
 error:
 {code}
 [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate 
 routines:X509_load_cert_crl_file:system lib
 {code}
 Nothing launches, the script exits.
 I am not sure if something on machine changed, this is a problem with EC2's 
 certs, or a problem with Python. 
 It occurs 100% of the time, and has been occurring over at least the last two 
 days. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-03-08 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6220:

Description: 
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 --ec2-instance-option 
{code}

  was:
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.


 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 --ec2-instance-option 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-03-08 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352489#comment-14352489
 ] 

Nicholas Chammas commented on SPARK-6220:
-

cc [~joshrosen] and [~shivaram] for feedback.

The immediate motivation for this is the work I'm doing on automating 
spark-perf runs.

As part of an automated spark-perf run, I'd like to:
* set {{instance_initiated_shutdown_behavior=terminate}} for the non-spot 
instances launched by spark-ec2 (i.e. the master), so that the cluster can 
self-terminate without needing outside input
* set {{instance_profile_arn}} for the master so that spark-perf results can be 
uploaded to S3 without having to handle AWS user credentials, via use of IAM 
profiles

Since my use case is specialized, I didn't think it was worth adding top-level 
options for these EC2 features. So I generalized the idea to support any EC2 
option supported by boto.

 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 --ec2-instance-option 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-03-08 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6220:

Description: 
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 \
...
--ec2-instance-option instance_initiated_shutdown_behavior=terminate \
--ec2-instance-option ebs_optimized=True
{code}

I'm not sure about the exact syntax of the extended options, but something like 
this will do the trick.


I followed the example of {{ssh}}, which supports multiple extended options 
similarly.

{code}
ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
{code}

  was:
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 --ec2-instance-option 
{code}


 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick.
 I followed the example of {{ssh}}, which supports multiple extended options 
 similarly.
 {code}
 ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-03-08 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6220:

Description: 
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 \
...
--ec2-instance-option instance_initiated_shutdown_behavior=terminate \
--ec2-instance-option ebs_optimized=True
{code}

I'm not sure about the exact syntax of the extended options, but something like 
this will do the trick as long as it can be made to pass the options correctly 
to boto in most cases.


I followed the example of {{ssh}}, which supports multiple extended options 
similarly.

{code}
ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
{code}

  was:
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 \
...
--ec2-instance-option instance_initiated_shutdown_behavior=terminate \
--ec2-instance-option ebs_optimized=True
{code}

I'm not sure about the exact syntax of the extended options, but something like 
this will do the trick.


I followed the example of {{ssh}}, which supports multiple extended options 
similarly.

{code}
ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
{code}


 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I followed

[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-03-08 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6220:

Description: 
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 \
...
--ec2-instance-option instance_initiated_shutdown_behavior=terminate \
--ec2-instance-option ebs_optimized=True
{code}

I'm not sure about the exact syntax of the extended options, but something like 
this will do the trick as long as it can be made to pass the options correctly 
to boto in most cases.


I followed the example of {{ssh}}, which supports multiple extended options 
similarly.

{code}
ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
{code}

  was:
There are many EC2 options exposed by the boto library that spark-ec2 uses. 

Over time, many of these EC2 options have been bubbled up here and there to 
become spark-ec2 options.

Examples:
* spot prices
* placement groups
* VPC, subnet, and security group assignments

It's likely that more and more EC2 options will trickle up like this to become 
spark-ec2 options.

While major options are well suited to this type of promotion, we should 
probably allow users to pass through EC2 options they want to use through 
spark-ec2 in some generic way.

Let's add two options:
* {{--ec2-instance-option}} - 
[{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances]
* {{--ec2-spot-instance-option}} - 
[{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]

Each option can be specified multiple times and is simply passed directly to 
the underlying boto call.

For example:
{code}
spark-ec2 \
...
--ec2-instance-option instance_initiated_shutdown_behavior=terminate \
--ec2-instance-option ebs_optimized=True
{code}

I'm not sure about the exact syntax of the extended options, but something like 
this will do the trick as long as it can be made to pass the options correctly 
to boto in most cases.


I followed the example of {{ssh}}, which supports multiple extended options 
similarly.

{code}
ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
{code}


 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I

[jira] [Created] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse

2015-03-08 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-6218:
---

 Summary: Upgrade spark-ec2 from optparse to argparse
 Key: SPARK-6218
 URL: https://issues.apache.org/jira/browse/SPARK-6218
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor


spark-ec2 [currently uses 
optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43].

In Python 2.7, optparse was [deprecated in favor of 
argparse|https://docs.python.org/2/library/optparse.html]. This is the main 
motivation for moving away from optparse.

Additionally, upgrading to argparse provides some [additional benefits noted in 
the 
docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. 
The one we are mostly likely to benefit from is the better input validation.

argparse is not include with Python 2.6, which is currently the minimum version 
of Python we support in Spark, but it can easily be downloaded by spark-ec2 
with the work that has already been done in SPARK-6191.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse

2015-03-08 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352331#comment-14352331
 ] 

Nicholas Chammas commented on SPARK-6218:
-

[~shivaram], [~joshrosen]: What do you think?

 Upgrade spark-ec2 from optparse to argparse
 ---

 Key: SPARK-6218
 URL: https://issues.apache.org/jira/browse/SPARK-6218
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 spark-ec2 [currently uses 
 optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43].
 In Python 2.7, optparse was [deprecated in favor of 
 argparse|https://docs.python.org/2/library/optparse.html]. This is the main 
 motivation for moving away from optparse.
 Additionally, upgrading to argparse provides some [additional benefits noted 
 in the 
 docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html].
  The one we are mostly likely to benefit from is the better input validation.
 Specifically, being able to cleanly tie each input parameter to a validation 
 method will cut down the input validation code currently spread out across 
 the script.
 argparse is not include with Python 2.6, which is currently the minimum 
 version of Python we support in Spark, but it can easily be downloaded by 
 spark-ec2 with the work that has already been done in SPARK-6191.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse

2015-03-08 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6218:

Description: 
spark-ec2 [currently uses 
optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43].

In Python 2.7, optparse was [deprecated in favor of 
argparse|https://docs.python.org/2/library/optparse.html]. This is the main 
motivation for moving away from optparse.

Additionally, upgrading to argparse provides some [additional benefits noted in 
the 
docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. 
The one we are mostly likely to benefit from is the better input validation.

Specifically, being able to cleanly tie each input parameter to a validation 
method will cut down the input validation code currently spread out across the 
script.

argparse is not include with Python 2.6, which is currently the minimum version 
of Python we support in Spark, but it can easily be downloaded by spark-ec2 
with the work that has already been done in SPARK-6191.

  was:
spark-ec2 [currently uses 
optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43].

In Python 2.7, optparse was [deprecated in favor of 
argparse|https://docs.python.org/2/library/optparse.html]. This is the main 
motivation for moving away from optparse.

Additionally, upgrading to argparse provides some [additional benefits noted in 
the 
docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. 
The one we are mostly likely to benefit from is the better input validation.

argparse is not include with Python 2.6, which is currently the minimum version 
of Python we support in Spark, but it can easily be downloaded by spark-ec2 
with the work that has already been done in SPARK-6191.


 Upgrade spark-ec2 from optparse to argparse
 ---

 Key: SPARK-6218
 URL: https://issues.apache.org/jira/browse/SPARK-6218
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 spark-ec2 [currently uses 
 optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43].
 In Python 2.7, optparse was [deprecated in favor of 
 argparse|https://docs.python.org/2/library/optparse.html]. This is the main 
 motivation for moving away from optparse.
 Additionally, upgrading to argparse provides some [additional benefits noted 
 in the 
 docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html].
  The one we are mostly likely to benefit from is the better input validation.
 Specifically, being able to cleanly tie each input parameter to a validation 
 method will cut down the input validation code currently spread out across 
 the script.
 argparse is not include with Python 2.6, which is currently the minimum 
 version of Python we support in Spark, but it can easily be downloaded by 
 spark-ec2 with the work that has already been done in SPARK-6191.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-03-08 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352524#comment-14352524
 ] 

Nicholas Chammas commented on SPARK-6220:
-

As far as places where we create instances, yes, those are the 2 calls we use.

 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I followed the example of {{ssh}}, which supports multiple extended options 
 similarly.
 {code}
 ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6219) Expand Python lint checks to check for compilation errors

2015-03-08 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-6219:
---

 Summary: Expand Python lint checks to check for  compilation errors
 Key: SPARK-6219
 URL: https://issues.apache.org/jira/browse/SPARK-6219
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas
Priority: Minor


An easy lint check for Python would be to make sure the stuff at least 
compiles. That will catch only the most egregious errors, but it should help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6191) Generalize spark-ec2's ability to download libraries from PyPI

2015-03-05 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6191:

Description: 
Right now we have a method to specifically download boto. Let's generalize it 
so it's easy to download additional libraries if we want.

Likely use cases:
* Downloading PyYAML for 

  was:Right now we have a method to specifically download boto. Let's 
generalize it so it's easy to download additional libraries if we want.


 Generalize spark-ec2's ability to download libraries from PyPI
 --

 Key: SPARK-6191
 URL: https://issues.apache.org/jira/browse/SPARK-6191
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 Right now we have a method to specifically download boto. Let's generalize it 
 so it's easy to download additional libraries if we want.
 Likely use cases:
 * Downloading PyYAML for 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6191) Generalize spark-ec2's ability to download libraries from PyPI

2015-03-05 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-6191:
---

 Summary: Generalize spark-ec2's ability to download libraries from 
PyPI
 Key: SPARK-6191
 URL: https://issues.apache.org/jira/browse/SPARK-6191
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor


Right now we have a method to specifically download boto. Let's generalize it 
so it's easy to download additional libraries if we want.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6191) Generalize spark-ec2's ability to download libraries from PyPI

2015-03-05 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-6191:

Description: 
Right now we have a method to specifically download boto. Let's generalize it 
so it's easy to download additional libraries if we want.

Likely use cases:
* Downloading PyYAML to allow spark-ec2 configs to be persisted as a TAML file. 
(SPARK-925)
* Downloading argparse to clean up / modernize our option parsing.

  was:
Right now we have a method to specifically download boto. Let's generalize it 
so it's easy to download additional libraries if we want.

Likely use cases:
* Downloading PyYAML for 


 Generalize spark-ec2's ability to download libraries from PyPI
 --

 Key: SPARK-6191
 URL: https://issues.apache.org/jira/browse/SPARK-6191
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 Right now we have a method to specifically download boto. Let's generalize it 
 so it's easy to download additional libraries if we want.
 Likely use cases:
 * Downloading PyYAML to allow spark-ec2 configs to be persisted as a TAML 
 file. (SPARK-925)
 * Downloading argparse to clean up / modernize our option parsing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator

2015-03-05 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349577#comment-14349577
 ] 

Nicholas Chammas commented on SPARK-3369:
-

{quote}
How about breaking backward compatibility
{quote}

The Spark project has made a big deal out of promising API stability. People 
trust that they can upgrade their version of Spark without breaking any of 
their code.

Breaking this promise would shake users' trust in the project. That's a big 
deal. Overall, it's not worth whatever benefit we hope to get out of fixing 
this issue.

This issue is tagged for 2+ and that seems to be the correct thing to do.

 Java mapPartitions Iterator-Iterable is inconsistent with Scala's 
 Iterator-Iterator
 -

 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2, 1.2.1
Reporter: Sean Owen
Assignee: Sean Owen
  Labels: breaking_change
 Attachments: FlatMapIterator.patch


 {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
 {{Iterator}} to an {{Iterator}}: 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
 In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
 an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
 stronger condition and appears inconsistent. It's a problematic inconsistent 
 though because this seems to require copying all of the input into memory in 
 order to create an object that can be iterated many times, since the input 
 does not afford this itself.
 Similarity for other {{mapPartitions*}} methods and other 
 {{*FlatMapFunctions}}s in Java.
 (Is there a reason for this difference that I'm overlooking?)
 If I'm right that this was inadvertent inconsistency, then the big issue here 
 is that of course this is part of a public API. Workarounds I can think of:
 Promise that Spark will only call {{iterator()}} once, so implementors can 
 use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
 Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
 desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6193) Speed up how spark-ec2 searches for clusters

2015-03-05 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-6193:
---

 Summary: Speed up how spark-ec2 searches for clusters
 Key: SPARK-6193
 URL: https://issues.apache.org/jira/browse/SPARK-6193
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor


{{spark-ec2}} currently pulls down [info for all 
instances|https://github.com/apache/spark/blob/eb48fd6e9d55fb034c00e61374bb9c2a86a82fb8/ec2/spark_ec2.py#L620]
 and searches locally for the target cluster. Instead, it should push those 
filters up when querying EC2.

For AWS accounts with hundreds of active instances, there is a difference of 
many seconds between the two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5473) Expose SSH failures after status checks pass

2015-03-04 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5473:

Description: 
If there is some fatal problem with launching a cluster, `spark-ec2` just hangs 
without giving the user useful feedback on what the problem is.

This PR exposes the output of the SSH calls to the user if the SSH test fails 
during cluster launch for any reason but the instance status checks are all 
green.

For example:

```
$ ./ec2/spark-ec2 -k key -i /incorrect/path/identity.pem --instance-type 
m3.medium --slaves 1 --zone us-east-1c launch spark-test
Setting up security groups...
Searching for existing cluster spark-test...
Spark AMI: ami-35b1885c
Launching instances...
Launched 1 slaves in us-east-1c, regid = r-7dadd096
Launched master in us-east-1c, regid = r-fcadd017
Waiting for cluster to enter 'ssh-ready' state...
Warning: SSH connection error. (This could be temporary.)
Host: 127.0.0.1
SSH return code: 255
SSH output: Warning: Identity file /incorrect/path/identity.pem not accessible: 
No such file or directory.
Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts.
Permission denied (publickey).
```

This should give users enough information when some unrecoverable error occurs 
during launch so they can know to abort the launch. This will help avoid 
situations like the ones reported [here on Stack 
Overflow](http://stackoverflow.com/q/28002443/) and [here on the user 
list](http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3c1422323829398-21381.p...@n3.nabble.com%3E),
 where the users couldn't tell what the problem was because it was being hidden 
by `spark-ec2`.

This is a usability improvement that should be backported to 1.2.

 Expose SSH failures after status checks pass
 

 Key: SPARK-5473
 URL: https://issues.apache.org/jira/browse/SPARK-5473
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
 Fix For: 1.3.0


 If there is some fatal problem with launching a cluster, `spark-ec2` just 
 hangs without giving the user useful feedback on what the problem is.
 This PR exposes the output of the SSH calls to the user if the SSH test fails 
 during cluster launch for any reason but the instance status checks are all 
 green.
 For example:
 ```
 $ ./ec2/spark-ec2 -k key -i /incorrect/path/identity.pem --instance-type 
 m3.medium --slaves 1 --zone us-east-1c launch spark-test
 Setting up security groups...
 Searching for existing cluster spark-test...
 Spark AMI: ami-35b1885c
 Launching instances...
 Launched 1 slaves in us-east-1c, regid = r-7dadd096
 Launched master in us-east-1c, regid = r-fcadd017
 Waiting for cluster to enter 'ssh-ready' state...
 Warning: SSH connection error. (This could be temporary.)
 Host: 127.0.0.1
 SSH return code: 255
 SSH output: Warning: Identity file /incorrect/path/identity.pem not 
 accessible: No such file or directory.
 Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts.
 Permission denied (publickey).
 ```
 This should give users enough information when some unrecoverable error 
 occurs during launch so they can know to abort the launch. This will help 
 avoid situations like the ones reported [here on Stack 
 Overflow](http://stackoverflow.com/q/28002443/) and [here on the user 
 list](http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3c1422323829398-21381.p...@n3.nabble.com%3E),
  where the users couldn't tell what the problem was because it was being 
 hidden by `spark-ec2`.
 This is a usability improvement that should be backported to 1.2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

2015-03-04 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-3533:

Target Version/s: 1.4.0

 Add saveAsTextFileByKey() method to RDDs
 

 Key: SPARK-3533
 URL: https://issues.apache.org/jira/browse/SPARK-3533
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Nicholas Chammas

 Users often have a single RDD of key-value pairs that they want to save to 
 multiple locations based on the keys.
 For example, say I have an RDD like this:
 {code}
  a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
  'Frankie']).keyBy(lambda x: x[0])
  a.collect()
 [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
  a.keys().distinct().collect()
 ['B', 'F', 'N']
 {code}
 Now I want to write the RDD out to different paths depending on the keys, so 
 that I have one output directory per distinct key. Each output directory 
 could potentially have multiple {{part-}} files, one per RDD partition.
 So the output would look something like:
 {code}
 /path/prefix/B [/part-1, /part-2, etc]
 /path/prefix/F [/part-1, /part-2, etc]
 /path/prefix/N [/part-1, /part-2, etc]
 {code}
 Though it may be possible to do this with some combination of 
 {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
 {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
 It's not clear if it's even possible at all in PySpark.
 Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
 that makes it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: spark-ec2 default to Hadoop 2

2015-03-02 Thread Nicholas Chammas
I might take a look at that pr if we get around to doing some perf testing
of Spark on various resource managers.

2015년 3월 2일 (월) 오후 12:22, Shivaram Venkataraman shiva...@eecs.berkeley.edu님이
작성:

FWIW there is a PR open to add support for Hadoop 2.4 to spark-ec2 scripts
 at https://github.com/mesos/spark-ec2/pull/77 -- But it hasnt' received
 much review or testing to be merged.

 Thanks
 Shivaram

 On Sun, Mar 1, 2015 at 11:49 PM, Sean Owen so...@cloudera.com wrote:

 I agree with that. My anecdotal impression is that Hadoop 1.x usage
 out there is maybe a couple percent, and so we should shift towards
 2.x at least as defaults.

 On Sun, Mar 1, 2015 at 10:59 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  https://github.com/apache/spark/blob/fd8d283eeb98e310b1e85ef8c3a8af
 9e547ab5e0/ec2/spark_ec2.py#L162-L164
 
  Is there any reason we shouldn't update the default Hadoop major
 version in
  spark-ec2 to 2?
 
  Nick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





[jira] [Commented] (SPARK-882) Have link for feedback/suggestions in docs

2015-03-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344475#comment-14344475
 ] 

Nicholas Chammas commented on SPARK-882:


Is the intended use here that users could submit corrections easily without 
having to open a JIRA/PR? I think that's a great idea; it lowers the barrier to 
providing feedback on a high visibility item like the docs.

Couple of questions:

1. Is integration with 3rd party tools like UserVoice or Disqus allowed? 
Actually, it might be really sweet if some simple, in-page feedback form 
automatically submitted a JIRA issue with the appropriate tags and info.

2. I assume the docs proper are the priority, right? Do we want to do this for 
the main site as well?

 Have link for feedback/suggestions in docs
 --

 Key: SPARK-882
 URL: https://issues.apache.org/jira/browse/SPARK-882
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Assignee: Patrick Cogan

 It would be cool to have a link at the top of the docs for 
 feedback/suggestions/errors. I bet we'd get a lot of interesting stuff from 
 that and it could be a good way to crowdsource correctness checking, since a 
 lot of us that write them never have to use them.
 Something to the right of the main top nav might be good. [~andyk] [~matei] - 
 what do you guys think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2545) Add a diagnosis mode for closures to figure out what they're bringing in

2015-03-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344482#comment-14344482
 ] 

Nicholas Chammas commented on SPARK-2545:
-

[~adav] - Would this potentially also be something to use in the REPL? If I 
understand correctly, the situation with closures is more complicated there, 
right.

 Add a diagnosis mode for closures to figure out what they're bringing in
 

 Key: SPARK-2545
 URL: https://issues.apache.org/jira/browse/SPARK-2545
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Aaron Davidson

 Today, it's pretty hard to figure out why your closure is bigger than 
 expected, because it's not obvious what objects are being included or who is 
 including them. We should have some sort of diagnosis available to users with 
 very large closures that displays the contents of the closure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2545) Add a diagnosis mode for closures to figure out what they're bringing in

2015-03-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344504#comment-14344504
 ] 

Nicholas Chammas commented on SPARK-2545:
-

cc [~tobias.schlatter]

 Add a diagnosis mode for closures to figure out what they're bringing in
 

 Key: SPARK-2545
 URL: https://issues.apache.org/jira/browse/SPARK-2545
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Aaron Davidson

 Today, it's pretty hard to figure out why your closure is bigger than 
 expected, because it's not obvious what objects are being included or who is 
 including them. We should have some sort of diagnosis available to users with 
 very large closures that displays the contents of the closure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2095) sc.getExecutorCPUCounts()

2015-03-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344480#comment-14344480
 ] 

Nicholas Chammas commented on SPARK-2095:
-

cc [~pwendell], [~joshrosen]

This seems like a useful thing to have, though you can accomplish something 
similar (though not as explicitly) with {{sc.defaultParallelism}}, which 
defaults to the number of executor cores in your cluster.

 sc.getExecutorCPUCounts()
 -

 Key: SPARK-2095
 URL: https://issues.apache.org/jira/browse/SPARK-2095
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Daniel Darabos
Priority: Minor

 We can get the amount of total and free memory (via getExecutorMemoryStatus) 
 and blocks stored (via getExecutorStorageStatus) on the executors. I would 
 also like to be able to query the available CPU per executor. This would be 
 useful in dynamically deciding the number of partitions at the start of an 
 operation. What do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



spark-ec2 default to Hadoop 2

2015-03-01 Thread Nicholas Chammas
https://github.com/apache/spark/blob/fd8d283eeb98e310b1e85ef8c3a8af9e547ab5e0/ec2/spark_ec2.py#L162-L164

Is there any reason we shouldn't update the default Hadoop major version in
spark-ec2 to 2?

Nick


[jira] [Commented] (SPARK-6077) Multiple spark streaming tabs on UI when reuse the same sparkcontext

2015-03-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342704#comment-14342704
 ] 

Nicholas Chammas commented on SPARK-6077:
-

Please disregard the comments on SPARK-2463 and focus on the description. The 
comments veer off into a separate issue from the one put forward in the 
description.

 Multiple spark streaming tabs on UI when reuse the same sparkcontext
 

 Key: SPARK-6077
 URL: https://issues.apache.org/jira/browse/SPARK-6077
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Reporter: zhichao-li
Priority: Minor

 Currently we would create a new streaming tab for each streamingContext even 
 if there's already one on the same sparkContext which would cause duplicate 
 StreamingTab created and none of them is taking effect. 
 snapshot: 
 https://www.dropbox.com/s/t4gd6hqyqo0nivz/bad%20multiple%20streamings.png?dl=0
 How to reproduce:
 1)
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming.{Seconds, StreamingContext}
 import org.apache.spark.storage.StorageLevel
 val ssc = new StreamingContext(sc, Seconds(1))
 val lines = ssc.socketTextStream(localhost, , 
 StorageLevel.MEMORY_AND_DISK_SER)
 val words = lines.flatMap(_.split( ))
 val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _)
 wordCounts.print()
 ssc.start()
 .
 2)
 ssc.stop(false)
 val ssc = new StreamingContext(sc, Seconds(1))
 val lines = ssc.socketTextStream(localhost, , 
 StorageLevel.MEMORY_AND_DISK_SER)
 val words = lines.flatMap(_.split( ))
 val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _)
 wordCounts.print()
 ssc.start()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2463) Creating then stopping StreamingContext multiple times from shell generates duplicate Streaming tabs in UI

2015-03-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342714#comment-14342714
 ] 

Nicholas Chammas commented on SPARK-2463:
-

For people reading through these comments, please keep in mind that this issue 
is describing a problem relating to starting and then stopping a streaming 
context multiple times. There is only ever 1 context running at a time.

*This issue has nothing to do with concurrently running contexts*, at least not 
directly.

 Creating then stopping StreamingContext multiple times from shell generates 
 duplicate Streaming tabs in UI
 --

 Key: SPARK-2463
 URL: https://issues.apache.org/jira/browse/SPARK-2463
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 1.0.1
Reporter: Nicholas Chammas
Assignee: Josh Rosen

 Start a {{StreamingContext}} from the interactive shell and then stop it. Go 
 to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for 
 Streaming.
 Now from the same shell, create and start a new {{StreamingContext}}. There 
 will now be a duplicate tab for Streaming in the UI. Repeating this process 
 generates additional Streaming tabs. 
 They all link to the same information.
 *Please note* that the issue of concurrently running contexts discussed in 
 the comments below is a completely separate issue.
 *This issue has nothing to do with concurrently running contexts.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2463) Creating then stopping StreamingContext multiple times from shell generates duplicate Streaming tabs in UI

2015-03-01 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-2463:

Description: 
Start a {{StreamingContext}} from the interactive shell and then stop it. Go to 
{{http://master_url:4040/streaming/}} and you will see a tab in the UI for 
Streaming.

Now from the same shell, create and start a new {{StreamingContext}} (and then 
stop it, if you want). There will now be a duplicate tab for Streaming in the 
UI. Repeating this process generates additional Streaming tabs. 

They all link to the same information.

*Please note* that the issue of concurrently running contexts discussed in the 
comments below is a completely separate issue.

*This issue has nothing to do with concurrently running streaming contexts.*

  was:
Start a {{StreamingContext}} from the interactive shell and then stop it. Go to 
{{http://master_url:4040/streaming/}} and you will see a tab in the UI for 
Streaming.

Now from the same shell, create and start a new {{StreamingContext}}. There 
will now be a duplicate tab for Streaming in the UI. Repeating this process 
generates additional Streaming tabs. 

They all link to the same information.

*Please note* that the issue of concurrently running contexts discussed in the 
comments below is a completely separate issue.

*This issue has nothing to do with concurrently running contexts.*


 Creating then stopping StreamingContext multiple times from shell generates 
 duplicate Streaming tabs in UI
 --

 Key: SPARK-2463
 URL: https://issues.apache.org/jira/browse/SPARK-2463
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 1.0.1
Reporter: Nicholas Chammas
Assignee: Josh Rosen

 Start a {{StreamingContext}} from the interactive shell and then stop it. Go 
 to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for 
 Streaming.
 Now from the same shell, create and start a new {{StreamingContext}} (and 
 then stop it, if you want). There will now be a duplicate tab for Streaming 
 in the UI. Repeating this process generates additional Streaming tabs. 
 They all link to the same information.
 *Please note* that the issue of concurrently running contexts discussed in 
 the comments below is a completely separate issue.
 *This issue has nothing to do with concurrently running streaming contexts.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2463) Creating then stopping StreamingContext multiple times from shell generates duplicate Streaming tabs in UI

2015-03-01 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-2463:

Description: 
Start a {{StreamingContext}} from the interactive shell and then stop it. Go to 
{{http://master_url:4040/streaming/}} and you will see a tab in the UI for 
Streaming.

Now from the same shell, create and start a new {{StreamingContext}}. There 
will now be a duplicate tab for Streaming in the UI. Repeating this process 
generates additional Streaming tabs. 

They all link to the same information.

*Please note* that the issue of concurrently running contexts discussed in the 
comments below is a completely separate issue.

*This issue has nothing to do with concurrently running contexts.*

  was:
Start a {{StreamingContext}} from the interactive shell and then stop it. Go to 
{{http://master_url:4040/streaming/}} and you will see a tab in the UI for 
Streaming.

Now from the same shell, create and start a new {{StreamingContext}}. There 
will now be a duplicate tab for Streaming in the UI. Repeating this process 
generates additional Streaming tabs. 

They all link to the same information.


 Creating then stopping StreamingContext multiple times from shell generates 
 duplicate Streaming tabs in UI
 --

 Key: SPARK-2463
 URL: https://issues.apache.org/jira/browse/SPARK-2463
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 1.0.1
Reporter: Nicholas Chammas
Assignee: Josh Rosen

 Start a {{StreamingContext}} from the interactive shell and then stop it. Go 
 to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for 
 Streaming.
 Now from the same shell, create and start a new {{StreamingContext}}. There 
 will now be a duplicate tab for Streaming in the UI. Repeating this process 
 generates additional Streaming tabs. 
 They all link to the same information.
 *Please note* that the issue of concurrently running contexts discussed in 
 the comments below is a completely separate issue.
 *This issue has nothing to do with concurrently running contexts.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6084) spark-shell broken on Windows

2015-02-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341787#comment-14341787
 ] 

Nicholas Chammas commented on SPARK-6084:
-

Ah, there's also SPARK-5396, though it's in Russian (?) so I'm not sure if the 
error is the same.

 spark-shell broken on Windows
 -

 Key: SPARK-6084
 URL: https://issues.apache.org/jira/browse/SPARK-6084
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0, 1.2.1
 Environment: Windows 7, Scala 2.11.4, Java 1.8
Reporter: Nicholas Chammas
  Labels: windows

 Original report here: 
 http://stackoverflow.com/questions/28747795/spark-launch-find-version
 For both spark-1.2.0-bin-hadoop2.4 and spark-1.2.1-bin-hadoop2.4, doing this:
 {code}
 bin\spark-shell.cmd
 {code}
 Yields the following error:
 {code}
 find: 'version': No such file or directory
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-02-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341789#comment-14341789
 ] 

Nicholas Chammas commented on SPARK-5389:
-

Yeah, I think we found another instance of this in SPARK-6084 / 
[here|http://stackoverflow.com/questions/28747795/spark-launch-find-version].

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
Reporter: Yana Kadiyska
Priority: Trivial
 Attachments: SparkShell_Win7.JPG


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial sine calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-6084) spark-shell broken on Windows

2015-02-28 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-6084:
-

Don't see how this is a dup of SPARK-4833.

 spark-shell broken on Windows
 -

 Key: SPARK-6084
 URL: https://issues.apache.org/jira/browse/SPARK-6084
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0, 1.2.1
 Environment: Windows 7, Scala 2.11.4, Java 1.8
Reporter: Nicholas Chammas
  Labels: windows

 Original report here: 
 http://stackoverflow.com/questions/28747795/spark-launch-find-version
 For both spark-1.2.0-bin-hadoop2.4 and spark-1.2.1-bin-hadoop2.4, doing this:
 {code}
 bin\spark-shell.cmd
 {code}
 Yields the following error:
 {code}
 find: 'version': No such file or directory
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6084) spark-shell broken on Windows

2015-02-28 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-6084.
-
Resolution: Duplicate

Resolving as duplicate of SPARK-5389. That seems a more likely match for this 
than SPARK-4833.

 spark-shell broken on Windows
 -

 Key: SPARK-6084
 URL: https://issues.apache.org/jira/browse/SPARK-6084
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0, 1.2.1
 Environment: Windows 7, Scala 2.11.4, Java 1.8
Reporter: Nicholas Chammas
  Labels: windows

 Original report here: 
 http://stackoverflow.com/questions/28747795/spark-launch-find-version
 For both spark-1.2.0-bin-hadoop2.4 and spark-1.2.1-bin-hadoop2.4, doing this:
 {code}
 bin\spark-shell.cmd
 {code}
 Yields the following error:
 {code}
 find: 'version': No such file or directory
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5396) Syntax error in spark scripts on windows.

2015-02-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341788#comment-14341788
 ] 

Nicholas Chammas commented on SPARK-5396:
-

What does that error message say in English? So we can pattern match to similar 
reports elsewhere.

 Syntax error in spark scripts on windows.
 -

 Key: SPARK-5396
 URL: https://issues.apache.org/jira/browse/SPARK-5396
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.0
 Environment: Window 7 and Window 8.1.
Reporter: Vladimir Protsenko
Assignee: Masayoshi TSUZUKI
Priority: Critical
 Fix For: 1.3.0

 Attachments: windows7.png, windows8.1.png


 I made the following steps: 
 1. downloaded and installed Scala 2.11.5 
 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 
 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean 
 package (in git bash) 
 After installation tried to run spark-shell.cmd in cmd shell and it says 
 there is a syntax error in file. The same with spark-shell2.cmd, 
 spark-submit.cmd and  spark-submit2.cmd.
 !windows7.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-02-28 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5389:

Description: 
spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 

spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2

Marking as trivial since calling spark-shell2.cmd also works fine

Attaching a screenshot since the error isn't very useful:

{code}
spark-1.2.0-bin-cdh4bin\spark-shell.cmd
else was unexpected at this time.
{code}

  was:
spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 

spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2

Marking as trivial sine calling spark-shell2.cmd also works fine

Attaching a screenshot since the error isn't very useful:

spark-1.2.0-bin-cdh4bin\spark-shell.cmd
else was unexpected at this time.

   Priority: Major  (was: Trivial)
Environment: Windows 7

Marking as major since the shell is technically broken. (Trivial is for mostly 
cosmetic problems.)

Reopening since multiple reports of this problem have come in.

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
 Environment: Windows 7
Reporter: Yana Kadiyska
 Attachments: SparkShell_Win7.JPG


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial since calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 {code}
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-02-28 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-5389:
-

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
 Environment: Windows 7
Reporter: Yana Kadiyska
 Attachments: SparkShell_Win7.JPG


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial since calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 {code}
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-02-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341790#comment-14341790
 ] 

Nicholas Chammas edited comment on SPARK-5389 at 2/28/15 9:48 PM:
--

Marking as major since the shell -is technically broken- is behaving terribly 
when Java cannot be found.

Reopening since multiple reports of this problem have come in.


was (Author: nchammas):
Marking as major since the shell is technically broken. (Trivial is for mostly 
cosmetic problems.)

Reopening since multiple reports of this problem have come in.

 spark-shell.cmd does not run from DOS Windows 7
 ---

 Key: SPARK-5389
 URL: https://issues.apache.org/jira/browse/SPARK-5389
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0
 Environment: Windows 7
Reporter: Yana Kadiyska
 Attachments: SparkShell_Win7.JPG


 spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
 spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
 Marking as trivial since calling spark-shell2.cmd also works fine
 Attaching a screenshot since the error isn't very useful:
 {code}
 spark-1.2.0-bin-cdh4bin\spark-shell.cmd
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6084) spark-shell broken on Windows

2015-02-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341776#comment-14341776
 ] 

Nicholas Chammas commented on SPARK-6084:
-

I took a look at the linked issue (SPARK-4833) and I don't see how they are 
duplicates. They both relate to spark-shell and Windows, but the error messages 
and conditions are different.

Here the use is claiming spark-shell fails with an error right away. There, the 
user is claiming spark-shell runs OK the first time, but then doesn't run a 
second time.

 spark-shell broken on Windows
 -

 Key: SPARK-6084
 URL: https://issues.apache.org/jira/browse/SPARK-6084
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0, 1.2.1
 Environment: Windows 7, Scala 2.11.4, Java 1.8
Reporter: Nicholas Chammas
  Labels: windows

 Original report here: 
 http://stackoverflow.com/questions/28747795/spark-launch-find-version
 For both spark-1.2.0-bin-hadoop2.4 and spark-1.2.1-bin-hadoop2.4, doing this:
 {code}
 bin\spark-shell.cmd
 {code}
 Yields the following error:
 {code}
 find: 'version': No such file or directory
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6084) spark-shell broken on Windows

2015-02-28 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-6084:
---

 Summary: spark-shell broken on Windows
 Key: SPARK-6084
 URL: https://issues.apache.org/jira/browse/SPARK-6084
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.1, 1.2.0
 Environment: Windows 7, Scala 2.11.4, Java 1.8
Reporter: Nicholas Chammas


Original report here: 
http://stackoverflow.com/questions/28747795/spark-launch-find-version

For both spark-1.2.0-bin-hadoop2.4 and spark-1.2.1-bin-hadoop2.4, doing this:

{code}
bin\spark-shell.cmd
{code}

Yields the following error:

{code}
find: 'version': No such file or directory
else was unexpected at this time.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6084) spark-shell broken on Windows

2015-02-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341746#comment-14341746
 ] 

Nicholas Chammas commented on SPARK-6084:
-

cc [~pwendell], [~andrewor14]

I haven't confirmed this issue myself. Just forwarding along the report I saw 
on Stack Overflow.

 spark-shell broken on Windows
 -

 Key: SPARK-6084
 URL: https://issues.apache.org/jira/browse/SPARK-6084
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.0, 1.2.1
 Environment: Windows 7, Scala 2.11.4, Java 1.8
Reporter: Nicholas Chammas
  Labels: windows

 Original report here: 
 http://stackoverflow.com/questions/28747795/spark-launch-find-version
 For both spark-1.2.0-bin-hadoop2.4 and spark-1.2.1-bin-hadoop2.4, doing this:
 {code}
 bin\spark-shell.cmd
 {code}
 Yields the following error:
 {code}
 find: 'version': No such file or directory
 else was unexpected at this time.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5971) Add Mesos support to spark-ec2

2015-02-24 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5971:

Description: 
Right now, spark-ec2 can only launch Spark clusters that use the standalone 
manager.

Adding support for Mesos would be useful mostly for automated performance 
testing of Spark on Mesos.

  was:
Right now, spark-ec2 can only launching Spark clusters that use the standalone 
manager.

Adding support to launch Spark-on-Mesos clusters would be useful mostly for 
automated performance testing of Spark on Mesos.


 Add Mesos support to spark-ec2
 --

 Key: SPARK-5971
 URL: https://issues.apache.org/jira/browse/SPARK-5971
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 Right now, spark-ec2 can only launch Spark clusters that use the standalone 
 manager.
 Adding support for Mesos would be useful mostly for automated performance 
 testing of Spark on Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces

2015-02-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335074#comment-14335074
 ] 

Nicholas Chammas commented on SPARK-3850:
-

Ah I see. I'm fine with closing this issue if that's the case. I opened it 
mostly because of the linked discussions. But actually wouldn't this check also 
cover those data files?

 Scala style: disallow trailing spaces
 -

 Key: SPARK-3850
 URL: https://issues.apache.org/jira/browse/SPARK-3850
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 Background discussions:
 * https://github.com/apache/spark/pull/2619
 * 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html
 If you look at [the PR Cheng 
 opened|https://github.com/apache/spark/pull/2619], you'll see a trailing 
 white space seemed to mess up some SQL test. That's what spurred the creation 
 of this issue.
 [Ted Yu on the dev 
 list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
  suggested using this 
 [{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3850) Scala style: disallow trailing spaces

2015-02-24 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-3850:

Description: 
Background discussions:
* https://github.com/apache/spark/pull/2619
* 
http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html

If you look at [the PR Cheng opened|https://github.com/apache/spark/pull/2619], 
you'll see a trailing white space seemed to mess up some SQL test. That's what 
spurred the creation of this issue.

[Ted Yu on the dev 
list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
 suggested using this 
[{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html].

  was:[Ted Yu on the dev 
list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
 suggested using {{WhitespaceEndOfLineChecker}} here: 
http://www.scalastyle.org/rules-0.1.0.html

   Priority: Minor  (was: Major)

 Scala style: disallow trailing spaces
 -

 Key: SPARK-3850
 URL: https://issues.apache.org/jira/browse/SPARK-3850
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 Background discussions:
 * https://github.com/apache/spark/pull/2619
 * 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html
 If you look at [the PR Cheng 
 opened|https://github.com/apache/spark/pull/2619], you'll see a trailing 
 white space seemed to mess up some SQL test. That's what spurred the creation 
 of this issue.
 [Ted Yu on the dev 
 list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
  suggested using this 
 [{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces

2015-02-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335045#comment-14335045
 ] 

Nicholas Chammas commented on SPARK-3850:
-

I guess the root is the [Style 
Guide|https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide].
 This and the parent issue are simply meant for automating existing rules, not 
introducing new ones. 

As an aside, this particular rule doesn't seem to be mentioned in the style 
guide, but it was discussed in a couple of places: 
* https://github.com/apache/spark/pull/2619
* 
http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html

If you look at the PR [~lian cheng] opened, you'll see a trailing white space 
seemed to mess up some SQL test.

 Scala style: disallow trailing spaces
 -

 Key: SPARK-3850
 URL: https://issues.apache.org/jira/browse/SPARK-3850
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Nicholas Chammas

 [Ted Yu on the dev 
 list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
  suggested using {{WhitespaceEndOfLineChecker}} here: 
 http://www.scalastyle.org/rules-0.1.0.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5971) Add Mesos support to spark-ec2

2015-02-24 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5971:

Summary: Add Mesos support to spark-ec2  (was: Add support for launching 
Spark-on-Mesos clusters to spark-ec2 )

 Add Mesos support to spark-ec2
 --

 Key: SPARK-5971
 URL: https://issues.apache.org/jira/browse/SPARK-5971
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 Right now, spark-ec2 can only launching Spark clusters that use the 
 standalone manager.
 Adding support to launch Spark-on-Mesos clusters would be useful mostly for 
 automated performance testing of Spark on Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5971) Add support for launching Spark-on-Mesos clusters to spark-ec2

2015-02-24 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-5971:
---

 Summary: Add support for launching Spark-on-Mesos clusters to 
spark-ec2 
 Key: SPARK-5971
 URL: https://issues.apache.org/jira/browse/SPARK-5971
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor


Right now, spark-ec2 can only launching Spark clusters that use the standalone 
manager.

Adding support to launch Spark-on-Mesos clusters would be useful mostly for 
automated performance testing of Spark on Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3674) Add support for launching YARN clusters in spark-ec2

2015-02-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335199#comment-14335199
 ] 

Nicholas Chammas commented on SPARK-3674:
-

There is an open PR for this here: https://github.com/mesos/spark-ec2/pull/77

 Add support for launching YARN clusters in spark-ec2
 

 Key: SPARK-3674
 URL: https://issues.apache.org/jira/browse/SPARK-3674
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman

 Right now spark-ec2 only supports launching Spark Standalone clusters. While 
 this is sufficient for basic usage it is hard to test features or do 
 performance benchmarking on YARN. It will be good to add support for 
 installing, configuring a Apache YARN cluster at a fixed version -- say the 
 latest stable version 2.4.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs

2015-02-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335573#comment-14335573
 ] 

Nicholas Chammas commented on SPARK-5312:
-

It's something to consider I guess. Spark provides strong guarantees about API 
stability and the like. 

Making it easy for reviewers to catch changes to public classes is supposed to 
help with that. What we have is perhaps good for now, and perhaps the 
foreseeable future.

So maybe we should resolve this issue for now and just keep it in mind in the 
future.

cc [~pwendell]

 Use sbt to detect new or changed public classes in PRs
 --

 Key: SPARK-5312
 URL: https://issues.apache.org/jira/browse/SPARK-5312
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 We currently use an [unwieldy grep/sed 
 contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
  to detect new public classes in PRs.
 Apparently, sbt lets you get a list of public classes [much more 
 directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
 {{show compile:discoveredMainClasses}}. We should use that instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Posting to the list

2015-02-23 Thread Nicholas Chammas
Nabble is a third-party site. If you send stuff through Nabble, Nabble has
to forward it along to the Apache mailing list. If something goes wrong
with that, you will have a message show up on Nabble that no-one saw.

The reverse can also happen, where something actually goes out on the list
and doesn't make it to Nabble.

Nabble is a nicer, third-party interface to the Apache list archives. No
more. It works best for reading through old threads.

Apache is the source of truth. Post through there.

Unfortunately, this is what we're stuck with. For a related
discussion, see this
thread about Discourse
http://apache-spark-user-list.1001560.n3.nabble.com/Discourse-A-proposed-alternative-to-the-Spark-User-list-td20851.html
.

Nick

On Sun Feb 22 2015 at 8:07:08 PM haihar nahak harihar1...@gmail.com wrote:

 I checked it but I didn't see any mail from user list. Let me do it one
 more time.

 [image: Inline image 1]

 --Harihar

 On Mon, Feb 23, 2015 at 11:50 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. i didnt get any new subscription mail in my inbox.

 Have you checked your Spam folder ?

 Cheers

 On Sun, Feb 22, 2015 at 2:36 PM, hnahak harihar1...@gmail.com wrote:

 I'm also facing the same issue, this is third time whenever I post
 anything
 it never accept by the community and at the same time got a failure mail
 in
 my register mail id.

 and when click to subscribe to this mailing list link, i didnt get any
 new
 subscription mail in my inbox.

 Please anyone suggest a best way to subscribed the email ID



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750p21756.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 {{{H2N}}}-(@:



Re: Launching Spark cluster on EC2 with Ubuntu AMI

2015-02-23 Thread Nicholas Chammas
I know that Spark EC2 scripts are not guaranteed to work with custom AMIs
but still, it should work…

Nope, it shouldn’t, unfortunately. The Spark base AMIs are custom-built for
spark-ec2. No other AMI will work unless it was built with that goal in
mind. Using a random AMI from the Amazon marketplace is unlikely to work
because there are several tools and packages (e.g. like git) that need to
be on the AMI.

Furthermore, the spark-ec2 scripts all assume a yum-based Linux
distribution, so you won’t be able to use Ubuntu (and apt-get-based distro)
without some significant changes to the shell scripts used to build the AMI.

There is some work ongoing as part of SPARK-3821
https://issues.apache.org/jira/browse/SPARK-3821 to make it easier to
generate AMIs that work with spark-ec2.

Nick
​

On Sun Feb 22 2015 at 7:42:52 PM Ted Yu yuzhih...@gmail.com wrote:

 bq. bash: git: command not found

 Looks like the AMI doesn't have git pre-installed.

 Cheers

 On Sun, Feb 22, 2015 at 4:29 PM, olegshirokikh o...@solver.com wrote:

 I'm trying to launch Spark cluster on AWS EC2 with custom AMI (Ubuntu)
 using
 the following:

 ./ec2/spark-ec2 --key-pair=*** --identity-file='/home/***.pem'
 --region=us-west-2 --zone=us-west-2b --spark-version=1.2.1 --slaves=2
 --instance-type=t2.micro --ami=ami-29ebb519 --user=ubuntu launch
 spark-ubuntu-cluster

 Everything starts OK and instances are launched:

 Found 1 master(s), 2 slaves
 Waiting for all instances in cluster to enter 'ssh-ready' state.
 Generating cluster's SSH key on master.

 But then I'm getting the following SSH errors until it stops trying and
 quits:

 bash: git: command not found
 Connection to ***.us-west-2.compute.amazonaws.com closed.
 Error executing remote command, retrying after 30 seconds: Command
 '['ssh',
 '-o', 'StrictHostKeyChecking=no', '-i', '/home/***t.pem', '-o',
 'UserKnownHostsFile=/dev/null', '-t', '-t',
 u'ubuntu@***.us-west-2.compute.amazonaws.com', 'rm -rf spark-ec2  git
 clone https://github.com/mesos/spark-ec2.git -b v4']' returned non-zero
 exit
 status 127

 I know that Spark EC2 scripts are not guaranteed to work with custom AMIs
 but still, it should work... Any advice would be greatly appreciated!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Launching-Spark-cluster-on-EC2-with-Ubuntu-AMI-tp21757.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





[jira] [Commented] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing

2015-02-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333496#comment-14333496
 ] 

Nicholas Chammas commented on SPARK-5944:
-

I'm not sure, but I think [here in the root 
POM|https://github.com/apache/spark/blob/242d49584c6aa21d928db2552033661950f760a5/pom.xml#L29]
 is where you can programmatically fetch the release version. (cc [~srowen] for 
verification)

Also, we should update the [release 
checklist|https://cwiki.apache.org/confluence/display/SPARK/Preparing+Spark+Releases#PreparingSparkReleases-PreparingSparkforRelease]
 so this isn't missed again.

Maybe this is something that goes in [this audit 
script|https://github.com/apache/spark/blob/master/dev/audit-release/audit_release.py]?
 (cc [~pwendell])

 Python release docs say SNAPSHOT + Author is missing
 

 Key: SPARK-5944
 URL: https://issues.apache.org/jira/browse/SPARK-5944
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.2.1
Reporter: Nicholas Chammas
Priority: Minor

 http://spark.apache.org/docs/latest/api/python/index.html
 As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 
 1.2.1.
 Furthermore, in the footer it says Copyright 2014, Author. It should 
 probably say something something else or be removed altogether.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing

2015-02-23 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5944:

Target Version/s: 1.2.2

 Python release docs say SNAPSHOT + Author is missing
 

 Key: SPARK-5944
 URL: https://issues.apache.org/jira/browse/SPARK-5944
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.2.1
Reporter: Nicholas Chammas
Priority: Minor

 http://spark.apache.org/docs/latest/api/python/index.html
 As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 
 1.2.1.
 Furthermore, in the footer it says Copyright 2014, Author. It should 
 probably say something something else or be removed altogether.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7

2015-02-23 Thread Nicholas Chammas
The first concern for Spark will probably be to ensure that we still build
and test against Python 2.6, since that's the minimum version of Python we
support.

Otherwise this seems OK. We use numpy and other Python packages in PySpark,
but I don't think we're pinned to any particular version of those packages.

Nick

On Mon Feb 23 2015 at 2:15:19 PM shane knapp skn...@berkeley.edu wrote:

 good morning, developers!

 TL;DR:

 i will be installing anaconda and setting it in the system PATH so that
 your python will default to 2.7, as well as it taking over management of
 all of the sci-py packages.  this is potentially a big change, so i'll be
 testing locally on my staging instance before deployment to the wide world.

 deployment is *tentatively* next monday, march 2nd.

 a little background:

 the jenkins test infra is currently (and happily) managed by a set of tools
 that allow me to set up and deploy new workers, manage their packages and
 make sure that all spark and research projects can happily and successfully
 build.

 we're currently at the state where ~50 or so packages are installed and
 configured on each worker.  this is getting a little cumbersome, as the
 package-to-build dep tree is getting pretty large.

 the biggest offender is the science-based python infrastructure.
  everything is blindly installed w/yum and pip, so it's hard to control
 *exactly* what version of any given library is as compared to what's on a
 dev's laptop.

 the solution:

 anaconda (https://store.continuum.io/cshop/anaconda/)!  everything is
 centralized!  i can manage specific versions much easier!

 what this means to you:

 * python 2.7 will be the default system python.
 * 2.6 will still be installed and available (/usr/bin/python or
 /usr/bin/python/2.6)

 what you need to do:
 * install anaconda, have it update your PATH
 * build locally and try to fix any bugs (for spark, this should just
 work)
 * if you have problems, reach out to me and i'll see what i can do to help.
  if we can't get your stuff running under python2.7, we can default to 2.6
 via a job config change.

 what i will be doing:
 * setting up anaconda on my staging instance and spot-testing a lot of
 builds before deployment

 please let me know if there are any issues/concerns...  i'll be posting
 updates this week and will let everyone know if there are any changes to
 the Plan[tm].

 your friendly devops engineer,

 shane



[jira] [Commented] (SPARK-4123) Show new dependencies added in pull requests

2015-02-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334352#comment-14334352
 ] 

Nicholas Chammas commented on SPARK-4123:
-

Go ahead! I haven't done anything for this yet.

 Show new dependencies added in pull requests
 

 Key: SPARK-4123
 URL: https://issues.apache.org/jira/browse/SPARK-4123
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Priority: Critical

 We should inspect the classpath of Spark's assembly jar for every pull 
 request. This only takes a few seconds in Maven and it will help weed out 
 dependency changes from the master branch. Ideally we'd post any dependency 
 changes in the pull request message.
 {code}
 $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
 INFO | tr : \n | awk -F/ '{print $NF}' | sort  my-classpath
 $ git checkout apache/master
 $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
 INFO | tr : \n | awk -F/ '{print $NF}' | sort  master-classpath
 $ diff my-classpath master-classpath
  chill-java-0.3.6.jar
  chill_2.10-0.3.6.jar
 ---
  chill-java-0.5.0.jar
  chill_2.10-0.5.0.jar
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3850) Scala style: disallow trailing spaces

2015-02-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334351#comment-14334351
 ] 

Nicholas Chammas edited comment on SPARK-3850 at 2/24/15 5:16 AM:
--

{quote}
enabled=false
{quote}

Per the parent issue SPARK-3849, I believe this issue is about enabling this 
rule in a non-intrusive way. So I think we still need this issue.


was (Author: nchammas):
{quote}
enabled=false
{quote}

Per the parent issue SPARK-3849, I believe this issue about enabling this rule 
in a non-intrusive way. So I think we still need this issue.

 Scala style: disallow trailing spaces
 -

 Key: SPARK-3850
 URL: https://issues.apache.org/jira/browse/SPARK-3850
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Nicholas Chammas

 [Ted Yu on the dev 
 list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
  suggested using {{WhitespaceEndOfLineChecker}} here: 
 http://www.scalastyle.org/rules-0.1.0.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs

2015-02-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334355#comment-14334355
 ] 

Nicholas Chammas commented on SPARK-5312:
-

Yeah, this is not a priority really. I looked into sbt and agree it's probably 
not suited to the task. I found something else that looks interesting: 
http://software.clapper.org/classutil/

But I don't have time to look into it.

 Use sbt to detect new or changed public classes in PRs
 --

 Key: SPARK-5312
 URL: https://issues.apache.org/jira/browse/SPARK-5312
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 We currently use an [unwieldy grep/sed 
 contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
  to detect new public classes in PRs.
 Apparently, sbt lets you get a list of public classes [much more 
 directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
 {{show compile:discoveredMainClasses}}. We should use that instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4958) Bake common tools like ganglia into Spark AMI

2015-02-22 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-4958.
-
   Resolution: Duplicate
Fix Version/s: (was: 1.3.0)

Closing this as a duplicate of SPARK-3821 since we're covering the addition of 
stuff like Ganglia to the AMIs in that issue.

 Bake common tools like ganglia into Spark AMI
 -

 Key: SPARK-4958
 URL: https://issues.apache.org/jira/browse/SPARK-4958
 Project: Spark
  Issue Type: Sub-task
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Improving metadata in Spark JIRA

2015-02-22 Thread Nicholas Chammas
Open pull request count is down to 254 right now from ~325 several weeks
ago.

This great. Ideally, we need to get this down to  50 and keep it there.
Having so many open pull requests is just a bad signal to contributors. But
it will take some time to get there.


   - 1+ Component

 Sean, do you have permission to edit our JIRA settings? It should be
possible to enforce this in JIRA itself.


   - 1+ Affects version

 I don’t think this field makes sense for improvements, right?

Nick
​

On Sun Feb 22 2015 at 9:43:24 AM Sean Owen so...@cloudera.com wrote:

 Open pull request count is down to 254 right now from ~325 several weeks
 ago.
 Open JIRA count is down slightly to 1262 from a peak over ~1320.
 Obviously, in the face of an ever faster and larger stream of
 contributions.

 There's a real positive impact of JIRA being a little more meaningful, a
 little less backlog to keep looking at, getting commits in slightly faster,
 slightly happier contributors, etc.


 The virtuous circle can keep going. It'd be great if every contributor
 could take a moment to look at his or her open PRs and JIRAs. Example
 searches (replace with your user name / name):

 https://github.com/apache/spark/pulls/srowen
 https://issues.apache.org/jira/issues/?jql=project%20%
 3D%20SPARK%20AND%20reporter%20%3D%20%22Sean%20Owen%22%
 20or%20assignee%20%3D%20%22Sean%20Owen%22

 For PRs:

 - if it appears to be waiting on your action or feedback,
   - push more changes and/or reply to comments, or
   - if it isn't work you can pursue in the immediate future, close the PR

 - if it appears to be waiting on others,
   - if it's had feedback and it's unclear whether there's support to commit
 as-is,
 - break down or reduce the change to something less controversial
 - close the PR as softly rejected
   - if there's no feedback or plainly waiting for action, ping @them

 For JIRAs:

 - If it's fixed along the way, or obsolete, resolve as Fixed or NotAProblem

 - Do a quick search to see if a similar issue has been filed and is
 resolved or has more activity; resolve as Duplicate if so

 - Check that fields are assigned reasonably:
   - Meaningful title and description
   - Reasonable type and priority. Not everything is a major bug, and few
 are blockers
   - 1+ Component
   - 1+ Affects version
   - Avoid setting target version until it looks like there's momentum to
 merge a resolution

 - If the JIRA has had no activity in a long time (6+ months), but does not
 feel obsolete, try to move it to some resolution:
   - Request feedback, from specific people if desired, to feel out if there
 is any other support for the change
   - Add more info, like a specific reproduction for bugs
   - Narrow scope of feature requests to something that contains a few
 actionable steps, instead of broad open-ended wishes
   - Work on a fix. In an ideal world people are willing to work to resolve
 JIRAs they open, and don't fire-and-forget


 If everyone did this, not only would it advance the house-cleaning a bit
 more, but I'm sure we'd rediscover some important work and issues that need
 attention.


 On Sun, Feb 22, 2015 at 7:54 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  As of right now, there are no more open JIRA issues without an assigned
  component
  https://issues.apache.org/jira/issues/?jql=project%20%
 3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%
 20component%20%3D%20EMPTY%20ORDER%20BY%20updated%20DESC!
  Hurray!
 
  [image: yay]
 
  Thanks to Sean and others for the cleanup!
 
  Nick
 
  ​
 



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-02-22 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332303#comment-14332303
 ] 

Nicholas Chammas commented on SPARK-3821:
-

For those wanting to use the work being done as part of this issue before it 
gets merged upstream, I posted some [instructions on Stack 
Overflow|http://stackoverflow.com/a/28639669/877069] in response to a related 
question.

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Git Achievements

2015-02-22 Thread Nicholas Chammas
For fun:

http://acha-acha.co/#/repo/https://github.com/apache/spark

I just added Spark to this site. Some of these “achievements” are hilarious.

Leo Tolstoy: More than 10 lines in a commit message

Dangerous Game: Commit after 6PM friday

Nick
​


[jira] [Commented] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing

2015-02-22 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332394#comment-14332394
 ] 

Nicholas Chammas commented on SPARK-5944:
-

cc [~davies], [~joshrosen]

 Python release docs say SNAPSHOT + Author is missing
 

 Key: SPARK-5944
 URL: https://issues.apache.org/jira/browse/SPARK-5944
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.2.1
Reporter: Nicholas Chammas
Priority: Minor

 http://spark.apache.org/docs/latest/api/python/index.html
 As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 
 1.2.1.
 Furthermore, in the footer it says Copyright 2014, Author. It should 
 probably say something something else or be removed altogether.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing

2015-02-22 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-5944:
---

 Summary: Python release docs say SNAPSHOT + Author is missing
 Key: SPARK-5944
 URL: https://issues.apache.org/jira/browse/SPARK-5944
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.2.1
Reporter: Nicholas Chammas
Priority: Minor


http://spark.apache.org/docs/latest/api/python/index.html

As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 
1.2.1.

Furthermore, in the footer it says Copyright 2014, Author. It should probably 
say something something else or be removed altogether.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-765) Test suite should run Spark example programs

2015-02-22 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332438#comment-14332438
 ] 

Nicholas Chammas commented on SPARK-765:


Seems like a good idea. [~joshrosen] I assume this is still to be done, right?

 Test suite should run Spark example programs
 

 Key: SPARK-765
 URL: https://issues.apache.org/jira/browse/SPARK-765
 Project: Spark
  Issue Type: New Feature
  Components: Examples
Reporter: Josh Rosen

 The Spark test suite should also run each of the Spark example programs (the 
 PySpark suite should do the same).  This should be done through a shell 
 script or other mechanism to simulate the environment setup used by end users 
 that run those scripts.
 This would prevent problems like SPARK-764 from making it into releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    7   8   9   10   11   12   13   14   15   16   >