[jira] [Commented] (SPARK-3849) Automate remaining Spark Code Style Guide rules
[ https://issues.apache.org/jira/browse/SPARK-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380603#comment-14380603 ] Nicholas Chammas commented on SPARK-3849: - Sounds good. My quick summary (which does not replace the due diligence just discussed) is that we need a way to enable new style rules (Scala at first, but maybe Python/R/Java too) on the whole repo. However, we don't want a new rule coming online to require fixing all outstanding problems at once. Rather, we want the rule to check the whole repo but fail the patch (via Jenkins) only if code touched in a given patch (i.e. from the git diff) failed some style rules. This will be impossible in cases where rule failures aren't tied to specific lines. But when they are (e.g. line too long), we want to line them up against the git diff line numbers. If there's overlap, fail the style check for that patch and point out the failing rule and line numbers. This way the repo can incrementally come into compliance with new style rules, rather than having to fix everything at once with a single, large, and painful patch. Automate remaining Spark Code Style Guide rules --- Key: SPARK-3849 URL: https://issues.apache.org/jira/browse/SPARK-3849 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Nicholas Chammas Style problems continue to take up a large amount of review time, mostly because there are many [Spark Code Style Guide|https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide] rules that have not been automated. This issue tracks the remaining rules that have not automated. To minimize the impact of introducing new rules that would otherwise require sweeping changes across the code base, we should look to *have new rules apply only to new code where possible*. See [this dev list discussion|http://apache-spark-developers-list.1001551.n3.nabble.com/Scalastyle-improvements-large-code-reformatting-td8755.html] for more background on this topic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380464#comment-14380464 ] Nicholas Chammas commented on SPARK-6481: - The Spark user can initiate state transitions, but the issue needs to be assigned to it in order to do so. So here's what I'm gonna do, after chatting briefly with Patrick: * Save the assigned user, if any * Assign to the Spark user * Mark as in progress ONLY IF the issue is Open ** I dunno if we want to change the issue state if it doesn't start out as Open. Lemme know if you disagree. * Restore the original assignee, including Unassigned if that's what it was. Sound good to everybody? I'm going to implement this in the [jira_api.py|https://github.com/databricks/spark-pr-dashboard/blob/master/sparkprs/jira_api.py] that Josh pointed me to. Set In Progress when a PR is opened for an issue -- Key: SPARK-6481 URL: https://issues.apache.org/jira/browse/SPARK-6481 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Michael Armbrust [~pwendell] and I are not sure if this is possible, but it would be really helpful if the JIRA status was updated to In Progress when we do the linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380585#comment-14380585 ] Nicholas Chammas commented on SPARK-6481: - PR for this: https://github.com/databricks/spark-pr-dashboard/pull/49 Set In Progress when a PR is opened for an issue -- Key: SPARK-6481 URL: https://issues.apache.org/jira/browse/SPARK-6481 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Michael Armbrust [~pwendell] and I are not sure if this is possible, but it would be really helpful if the JIRA status was updated to In Progress when we do the linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378838#comment-14378838 ] Nicholas Chammas commented on SPARK-6481: - Since there is no guaranteed way to map GitHub usernames to JIRA usernames, what should we do about the JIRA assignee? A JIRA issue needs an assignee in order to be marked In Progress. We can have the script: # always assign the issue to the Apache Spark user # keep it assigned to whoever has it assigned, if any (this may be different from the PR user) # in the case of no current assignee, assign to Apache Spark just to mark the JIRA in progress, then remove assignee Any preferences [~marmbrus] / [~pwendell]? Set In Progress when a PR is opened for an issue -- Key: SPARK-6481 URL: https://issues.apache.org/jira/browse/SPARK-6481 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Michael Armbrust Assignee: Nicholas Chammas [~pwendell] and I are not sure if this is possible, but it would be really helpful if the JIRA status was updated to In Progress when we do the linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378114#comment-14378114 ] Nicholas Chammas commented on SPARK-6481: - [~pwendell] - Where is the GitHub JIRA sync script triggered from? I want to see how it's invoked, as well as get some way to run the script on demand for testing. Set In Progress when a PR is opened for an issue -- Key: SPARK-6481 URL: https://issues.apache.org/jira/browse/SPARK-6481 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Michael Armbrust Assignee: Nicholas Chammas [~pwendell] and I are not sure if this is possible, but it would be really helpful if the JIRA status was updated to In Progress when we do the linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6481) Set In Progress when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378393#comment-14378393 ] Nicholas Chammas edited comment on SPARK-6481 at 3/24/15 7:07 PM: -- Ah, thanks for the pointers. So should that script be removed from the Spark repo? Also, how would I go about testing changes to {{jira_api.py}} without getting credentials? was (Author: nchammas): So should that script be removed from the Spark repo? Also, how would I go about testing changes to {{jira_api.py}} without getting credentials? Set In Progress when a PR is opened for an issue -- Key: SPARK-6481 URL: https://issues.apache.org/jira/browse/SPARK-6481 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Michael Armbrust Assignee: Nicholas Chammas [~pwendell] and I are not sure if this is possible, but it would be really helpful if the JIRA status was updated to In Progress when we do the linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378393#comment-14378393 ] Nicholas Chammas commented on SPARK-6481: - So should that script be removed from the Spark repo? Also, how would I go about testing changes to {{jira_api.py}} without getting credentials? Set In Progress when a PR is opened for an issue -- Key: SPARK-6481 URL: https://issues.apache.org/jira/browse/SPARK-6481 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Michael Armbrust Assignee: Nicholas Chammas [~pwendell] and I are not sure if this is possible, but it would be really helpful if the JIRA status was updated to In Progress when we do the linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378436#comment-14378436 ] Nicholas Chammas commented on SPARK-6481: - The change Michael/Patrick want is for state transitions, and AFAICT I don't have permission to do that with my personal JIRA account. If my personal account is given the appropriate permissions (need to trigger state transitions; need to view project workflow), then certainly I can test things out using my personal credentials. Set In Progress when a PR is opened for an issue -- Key: SPARK-6481 URL: https://issues.apache.org/jira/browse/SPARK-6481 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Michael Armbrust Assignee: Nicholas Chammas [~pwendell] and I are not sure if this is possible, but it would be really helpful if the JIRA status was updated to In Progress when we do the linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters
[ https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375929#comment-14375929 ] Nicholas Chammas commented on SPARK-2394: - Thank you for posting this information for others! Make it easier to read LZO-compressed files from EC2 clusters - Key: SPARK-2394 URL: https://issues.apache.org/jira/browse/SPARK-2394 Project: Spark Issue Type: Improvement Components: EC2, Input/Output Affects Versions: 1.0.0 Reporter: Nicholas Chammas Priority: Minor Labels: compression Amazon hosts [a large Google n-grams data set on S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is perfect, among other things, for putting together interesting and easily reproducible public demos of Spark's capabilities. The problem is that the data set is compressed using LZO, and it is currently more painful than it should be to get your average {{spark-ec2}} cluster to read input compressed in this way. This is what one has to go through to get a Spark cluster created with {{spark-ec2}} to read LZO-compressed files: # Install the latest LZO release, perhaps via {{yum}}. # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build it. To build {{hadoop-lzo}} you need Maven. # Install Maven. For some reason, [you cannot install Maven with {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum], so install it manually. # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E]. # Make [the appropriate calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E] to {{sc.newAPIHadoopFile}}. This seems like a bit too much work for what we're trying to accomplish. If we expect this to be a common pattern -- reading LZO-compressed files from a {{spark-ec2}} cluster -- it would be great if we could somehow make this less painful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6474: Issue Type: Improvement (was: Bug) Replace image.run with connection.run_instances in spark_ec2.py --- Key: SPARK-6474 URL: https://issues.apache.org/jira/browse/SPARK-6474 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Andrew Drozdov Priority: Minor After looking at an issue in Boto [1], ec2.image.Image.run and ec2.connection.EC2Connection.run_instances are similar calls, but run_instances appears to have more features and is more up to date. For example, run_instances has the capability to launch ebs_optimized instances while run does not. The run call is being used in only a couple places in spark_ec2.py, so let's replace it with run_instances. [1] https://github.com/boto/boto/issues/3054 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376584#comment-14376584 ] Nicholas Chammas commented on SPARK-6474: - This change also fits the pattern of [{{request_spot_instances()}}|https://github.com/apache/spark/blob/474d1320c9b93c501710ad1cfa836b8284562a2c/ec2/spark_ec2.py#L542], which is called on the connection like {{run_instances()}} as opposed to on an {{Image}}. Replace image.run with connection.run_instances in spark_ec2.py --- Key: SPARK-6474 URL: https://issues.apache.org/jira/browse/SPARK-6474 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Andrew Drozdov Priority: Minor After looking at an issue in Boto [1], ec2.image.Image.run and ec2.connection.EC2Connection.run_instances are similar calls, but run_instances appears to have more features and is more up to date. For example, run_instances has the capability to launch ebs_optimized instances while run does not. The run call is being used in only a couple places in spark_ec2.py, so let's replace it with run_instances. [1] https://github.com/boto/boto/issues/3054 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6474: Priority: Minor (was: Major) Replace image.run with connection.run_instances in spark_ec2.py --- Key: SPARK-6474 URL: https://issues.apache.org/jira/browse/SPARK-6474 Project: Spark Issue Type: Bug Components: EC2 Reporter: Andrew Drozdov Priority: Minor After looking at an issue in Boto [1], ec2.image.Image.run and ec2.connection.EC2Connection.run_instances are similar calls, but run_instances appears to have more features and is more up to date. For example, run_instances has the capability to launch ebs_optimized instances while run does not. The run call is being used in only a couple places in spark_ec2.py, so let's replace it with run_instances. [1] https://github.com/boto/boto/issues/3054 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376572#comment-14376572 ] Nicholas Chammas edited comment on SPARK-6474 at 3/23/15 8:29 PM: -- LGTM. Just setting the Priority to Minor since this doesn't cause any major problems, though it should be fixed. was (Author: nchammas): LGTM. Replace image.run with connection.run_instances in spark_ec2.py --- Key: SPARK-6474 URL: https://issues.apache.org/jira/browse/SPARK-6474 Project: Spark Issue Type: Bug Components: EC2 Reporter: Andrew Drozdov Priority: Minor After looking at an issue in Boto [1], ec2.image.Image.run and ec2.connection.EC2Connection.run_instances are similar calls, but run_instances appears to have more features and is more up to date. For example, run_instances has the capability to launch ebs_optimized instances while run does not. The run call is being used in only a couple places in spark_ec2.py, so let's replace it with run_instances. [1] https://github.com/boto/boto/issues/3054 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6474) Replace image.run with connection.run_instances in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376572#comment-14376572 ] Nicholas Chammas commented on SPARK-6474: - LGTM. Replace image.run with connection.run_instances in spark_ec2.py --- Key: SPARK-6474 URL: https://issues.apache.org/jira/browse/SPARK-6474 Project: Spark Issue Type: Bug Components: EC2 Reporter: Andrew Drozdov After looking at an issue in Boto [1], ec2.image.Image.run and ec2.connection.EC2Connection.run_instances are similar calls, but run_instances appears to have more features and is more up to date. For example, run_instances has the capability to launch ebs_optimized instances while run does not. The run call is being used in only a couple places in spark_ec2.py, so let's replace it with run_instances. [1] https://github.com/boto/boto/issues/3054 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6481) Set In Progress when a PR is opened for an issue
[ https://issues.apache.org/jira/browse/SPARK-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377034#comment-14377034 ] Nicholas Chammas commented on SPARK-6481: - I'm guessing this will be done via [github_jira_sync.py|https://github.com/apache/spark/blob/master/dev/github_jira_sync.py]. OK, will take a look this week. Set In Progress when a PR is opened for an issue -- Key: SPARK-6481 URL: https://issues.apache.org/jira/browse/SPARK-6481 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Michael Armbrust Assignee: Nicholas Chammas [~pwendell] and I are not sure if this is possible, but it would be really helpful if the JIRA status was updated to In Progress when we do the linking to an open pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[issue21423] concurrent.futures.ThreadPoolExecutor/ProcessPoolExecutor should accept an initializer argument
Changes by Nicholas Chammas nicholas.cham...@gmail.com: -- nosy: +Nicholas Chammas ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue21423 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
Re: Apache Spark User List: people's responses not showing in the browser view
Nabble is a third-party site that tries its best to archive mail sent out over the list. Nothing guarantees it will be in sync with the real mailing list. To get the truth on what was sent over this, Apache-managed list, you unfortunately need to go the Apache archives: http://mail-archives.apache.org/mod_mbox/spark-user/ Nick On Thu, Mar 19, 2015 at 5:18 AM Ted Yu yuzhih...@gmail.com wrote: There might be some delay: http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responsessubj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thanks, Ted. Well, so far even there I'm only seeing my post and not, for example, your response. On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu yuzhih...@gmail.com wrote: Was this one of the threads you participated ? http://search-hadoop.com/m/JW1q5w0p8x1 You should be able to find your posts on search-hadoop.com On Wed, Mar 18, 2015 at 3:21 PM, dgoldenberg dgoldenberg...@gmail.com wrote: Sorry if this is a total noob question but is there a reason why I'm only seeing folks' responses to my posts in emails but not in the browser view under apache-spark-user-list.1001560.n3.nabble.com? Is this a matter of setting your preferences such that your responses only go to email and never to the browser-based view of the list? I don't seem to see such a preference... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-User-List-people-s-responses-not-showing-in-the-browser-view-tp22135.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Apache Spark User List: people's responses not showing in the browser view
Sure, you can use Nabble or search-hadoop or whatever you prefer. My point is just that the source of truth are the Apache archives, and these other sites may or may not be in sync with that truth. On Thu, Mar 19, 2015 at 10:20 AM Ted Yu yuzhih...@gmail.com wrote: I prefer using search-hadoop.com which provides better search capability. Cheers On Thu, Mar 19, 2015 at 6:48 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Nabble is a third-party site that tries its best to archive mail sent out over the list. Nothing guarantees it will be in sync with the real mailing list. To get the truth on what was sent over this, Apache-managed list, you unfortunately need to go the Apache archives: http://mail-archives.apache.org/mod_mbox/spark-user/ Nick On Thu, Mar 19, 2015 at 5:18 AM Ted Yu yuzhih...@gmail.com wrote: There might be some delay: http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responsessubj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thanks, Ted. Well, so far even there I'm only seeing my post and not, for example, your response. On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu yuzhih...@gmail.com wrote: Was this one of the threads you participated ? http://search-hadoop.com/m/JW1q5w0p8x1 You should be able to find your posts on search-hadoop.com On Wed, Mar 18, 2015 at 3:21 PM, dgoldenberg dgoldenberg...@gmail.com wrote: Sorry if this is a total noob question but is there a reason why I'm only seeing folks' responses to my posts in emails but not in the browser view under apache-spark-user-list.1001560.n3.nabble.com? Is this a matter of setting your preferences such that your responses only go to email and never to the browser-based view of the list? I don't seem to see such a preference... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-User-List-people-s-responses-not-showing-in-the-browser-view-tp22135.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Apache Spark User List: people's responses not showing in the browser view
Yes, that is mostly why these third-party sites have sprung up around the official archives--to provide better search. Did you try the link Ted posted? On Thu, Mar 19, 2015 at 10:49 AM Dmitry Goldenberg dgoldenberg...@gmail.com wrote: It seems that those archives are not necessarily easy to find stuff in. Is there a search engine on top of them? so as to find e.g. your own posts easily? On Thu, Mar 19, 2015 at 10:34 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Sure, you can use Nabble or search-hadoop or whatever you prefer. My point is just that the source of truth are the Apache archives, and these other sites may or may not be in sync with that truth. On Thu, Mar 19, 2015 at 10:20 AM Ted Yu yuzhih...@gmail.com wrote: I prefer using search-hadoop.com which provides better search capability. Cheers On Thu, Mar 19, 2015 at 6:48 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Nabble is a third-party site that tries its best to archive mail sent out over the list. Nothing guarantees it will be in sync with the real mailing list. To get the truth on what was sent over this, Apache-managed list, you unfortunately need to go the Apache archives: http://mail-archives.apache.org/mod_mbox/spark-user/ Nick On Thu, Mar 19, 2015 at 5:18 AM Ted Yu yuzhih...@gmail.com wrote: There might be some delay: http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responsessubj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thanks, Ted. Well, so far even there I'm only seeing my post and not, for example, your response. On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu yuzhih...@gmail.com wrote: Was this one of the threads you participated ? http://search-hadoop.com/m/JW1q5w0p8x1 You should be able to find your posts on search-hadoop.com On Wed, Mar 18, 2015 at 3:21 PM, dgoldenberg dgoldenberg...@gmail.com wrote: Sorry if this is a total noob question but is there a reason why I'm only seeing folks' responses to my posts in emails but not in the browser view under apache-spark-user-list.1001560.n3.nabble.com? Is this a matter of setting your preferences such that your responses only go to email and never to the browser-based view of the list? I don't seem to see such a preference... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-User-List-people-s-responses-not-showing-in-the-browser-view-tp22135.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Processing of text file in large gzip archive
You probably want to update this line as follows: lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3) For more details on why, see this answer http://stackoverflow.com/a/27631722/877069. Nick On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier mps@gmail.com wrote: 1. I don't think textFile is capable of unpacking a .gz file. You need to use hadoopFile or newAPIHadoop file for this. Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is compute splits on gz files, so if you have a single file, you'll have a single partition. Processing 30 GB of gzipped data should not take that long, at least with the Scala API. Python not sure, especially under 1.2.1.
[jira] [Updated] (SPARK-6342) Leverage cfncluster in spark_ec2
[ https://issues.apache.org/jira/browse/SPARK-6342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6342: Component/s: EC2 Leverage cfncluster in spark_ec2 - Key: SPARK-6342 URL: https://issues.apache.org/jira/browse/SPARK-6342 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Alex Rothberg Priority: Minor Consider taking advantage of cfncluster (http://cfncluster.readthedocs.org/en/latest/) in the spark_ec2 script. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function
[ https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360534#comment-14360534 ] Nicholas Chammas commented on SPARK-6282: - [~joshrosen], [~davies]: Does this error look familiar to you? Strange Python import error when using random() in a lambda function Key: SPARK-6282 URL: https://issues.apache.org/jira/browse/SPARK-6282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Kubuntu 14.04, Python 2.7.6 Reporter: Pavel Laskov Priority: Minor Consider the exemplary Python code below: from random import random from pyspark.context import SparkContext from xval_mllib import read_csv_file_as_list if __name__ == __main__: sc = SparkContext(appName=Random() bug test) data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv')) #data = sc.parallelize([1, 2, 3, 4, 5], 2) d = data.map(lambda x: (random(), x)) print d.first() Data is read from a large CSV file. Running this code results in a Python import error: ImportError: No module named _winreg If I use 'import random' and 'random.random()' in the lambda function no error occurs. Also no error occurs, for both kinds of import statements, for a small artificial data set like the one shown in a commented line. The full error trace, the source code of csv reading code (function 'read_csv_file_as_list' is my own) as well as a sample dataset (the original dataset is about 8M large) can be provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function
[ https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359404#comment-14359404 ] Nicholas Chammas commented on SPARK-6282: - Shouldn't be related to boto. _winreg appears to be something Python uses to access the Windows registry, which is strange. Please give us more details about your cluster setup, where you are running the driver from, etc. Also, what if you try using numpy's implementation of {{random}}? Strange Python import error when using random() in a lambda function Key: SPARK-6282 URL: https://issues.apache.org/jira/browse/SPARK-6282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Kubuntu 14.04, Python 2.7.6 Reporter: Pavel Laskov Priority: Minor Consider the exemplary Python code below: from random import random from pyspark.context import SparkContext from xval_mllib import read_csv_file_as_list if __name__ == __main__: sc = SparkContext(appName=Random() bug test) data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv')) #data = sc.parallelize([1, 2, 3, 4, 5], 2) d = data.map(lambda x: (random(), x)) print d.first() Data is read from a large CSV file. Running this code results in a Python import error: ImportError: No module named _winreg If I use 'import random' and 'random.random()' in the lambda function no error occurs. Also no error occurs, for both kinds of import statements, for a small artificial data set like the one shown in a commented line. The full error trace, the source code of csv reading code (function 'read_csv_file_as_list' is my own) as well as a sample dataset (the original dataset is about 8M large) can be provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master
[ https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-5189: Description: As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, then setting up all the slaves together. This includes broadcasting files from the lonely master to potentially hundreds of slaves. There are 2 main problems with this approach: # Broadcasting files from the master to all slaves using [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] (e.g. during [ephemeral-hdfs init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36], or during [Spark setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3]) takes a long time. This time increases as the number of slaves increases. I did some testing in {{us-east-1}}. This is, concretely, what the problem looks like: || number of slaves ({{m3.large}}) || launch time (best of 6 tries) || | 1 | 8m 44s | | 10 | 13m 45s | | 25 | 22m 50s | | 50 | 37m 30s | | 75 | 51m 30s | | 99 | 1h 5m 30s | Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, but I think the point is clear enough. # It's more complicated to add slaves to an existing cluster (a la [SPARK-2008]), since slaves are only configured through the master during the setup of the master itself. Logically, the operations we want to implement are: * Provision a Spark node * Join a node to a cluster (including an empty cluster) as either a master or a slave * Remove a node from a cluster We need our scripts to roughly be organized to match the above operations. The goals would be: # When launching a cluster, enable all cluster nodes to be provisioned in parallel, removing the master-to-slave file broadcast bottleneck. # Facilitate cluster modifications like adding or removing nodes. # Enable exploration of infrastructure tools like [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} internals and perhaps even allow us to build [one tool that launches Spark clusters on several different cloud platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw]. More concretely, the modifications we need to make are: * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with equivalent, slave-side operations. * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it fully creates a node that can be used as either a master or slave. * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, configures it as a master or slave, and joins it to a cluster. * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete that script. was: As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, then setting up all the slaves together. This includes broadcasting files from the lonely master to potentially hundreds of slaves. There are 2 main problems with this approach: # Broadcasting files from the master to all slaves using [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] (e.g. during [ephemeral-hdfs init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36], or during [Spark setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3]) takes a long time. This time increases as the number of slaves increases. # It's more complicated to add slaves to an existing cluster (a la [SPARK-2008]), since slaves are only configured through the master during the setup of the master itself. Logically, the operations we want to implement are: * Provision a Spark node * Join a node to a cluster (including an empty cluster) as either a master or a slave * Remove a node from a cluster We need our scripts to roughly be organized to match the above operations. The goals would be: # When launching a cluster, enable all cluster nodes to be provisioned in parallel, removing the master-to-slave file broadcast bottleneck. # Facilitate cluster modifications like adding or removing nodes. # Enable exploration of infrastructure tools like [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} internals and perhaps even allow us to build [one tool that launches Spark clusters on several different cloud platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw]. More concretely, the modifications we need to make are: * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with equivalent, slave-side operations. * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it fully creates a node that can be used as either a master or slave. * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned
[jira] [Commented] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master
[ https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359665#comment-14359665 ] Nicholas Chammas commented on SPARK-5189: - For the record, this is the script I used to get the launch time stats above: {code} { python -m timeit -r 6 -n 1 \ --setup 'import subprocess; import time; subprocess.call(yes y | ./ec2/spark-ec2 destroy launch-test --identity-file /path/to/file.pem --key-pair my-pair --region us-east-1, shell=True); time.sleep(60)' \ 'subprocess.call(./ec2/spark-ec2 launch launch-test --slaves 99 --identity-file /path/to/file.pem --key-pair my-pair --region us-east-1 --zone us-east-1c --instance-type m3.large, shell=True)' yes y | ./ec2/spark-ec2 destroy launch-test --identity-file /path/to/file.pem --key-pair my-pair --region us-east-1 } {code} Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master --- Key: SPARK-5189 URL: https://issues.apache.org/jira/browse/SPARK-5189 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, then setting up all the slaves together. This includes broadcasting files from the lonely master to potentially hundreds of slaves. There are 2 main problems with this approach: # Broadcasting files from the master to all slaves using [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] (e.g. during [ephemeral-hdfs init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36], or during [Spark setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3]) takes a long time. This time increases as the number of slaves increases. I did some testing in {{us-east-1}}. This is, concretely, what the problem looks like: || number of slaves ({{m3.large}}) || launch time (best of 6 tries) || | 1 | 8m 44s | | 10 | 13m 45s | | 25 | 22m 50s | | 50 | 37m 30s | | 75 | 51m 30s | | 99 | 1h 5m 30s | Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, but I think the point is clear enough. # It's more complicated to add slaves to an existing cluster (a la [SPARK-2008]), since slaves are only configured through the master during the setup of the master itself. Logically, the operations we want to implement are: * Provision a Spark node * Join a node to a cluster (including an empty cluster) as either a master or a slave * Remove a node from a cluster We need our scripts to roughly be organized to match the above operations. The goals would be: # When launching a cluster, enable all cluster nodes to be provisioned in parallel, removing the master-to-slave file broadcast bottleneck. # Facilitate cluster modifications like adding or removing nodes. # Enable exploration of infrastructure tools like [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} internals and perhaps even allow us to build [one tool that launches Spark clusters on several different cloud platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw]. More concretely, the modifications we need to make are: * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with equivalent, slave-side operations. * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it fully creates a node that can be used as either a master or slave. * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, configures it as a master or slave, and joins it to a cluster. * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete that script. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4325) Improve spark-ec2 cluster launch times
[ https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354956#comment-14354956 ] Nicholas Chammas commented on SPARK-4325: - At this point it's more an umbrella task containing any issues that impact spark-ec2 cluster launch times. Dunno if that's appropriate, but I've seen other issues structured like this. I'm fine with closing this issue, but it's what I'm using to group issues related to the same problem. Improve spark-ec2 cluster launch times -- Key: SPARK-4325 URL: https://issues.apache.org/jira/browse/SPARK-4325 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.3.0 There are several optimizations we know we can make to [{{setup.sh}} | https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches faster. There are also some improvements to the AMIs that will help a lot. Potential improvements: * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This will reduce or eliminate SSH wait time and Ganglia init time. * Replace instances of {{download; rsync to rest of cluster}} with parallel downloads on all nodes of the cluster. * Replace instances of {code} for node in $NODES; do command sleep 0.3 done wait{code} with simpler calls to {{pssh}}. * Remove the [linear backoff | https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665] when we wait for SSH availability now that we are already waiting for EC2 status checks to clear before testing SSH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4325) Improve spark-ec2 cluster launch times
[ https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354939#comment-14354939 ] Nicholas Chammas commented on SPARK-4325: - [~srowen] - I should perhaps change the linked issues to contains, since SPARK-5189 and SPARK-3821 are where the actual launch time improvements are. The subtasks here (1 of which was just resolved as as dup of SPARK-3821), are relatively insignificant. Improve spark-ec2 cluster launch times -- Key: SPARK-4325 URL: https://issues.apache.org/jira/browse/SPARK-4325 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.3.0 There are several optimizations we know we can make to [{{setup.sh}} | https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches faster. There are also some improvements to the AMIs that will help a lot. Potential improvements: * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This will reduce or eliminate SSH wait time and Ganglia init time. * Replace instances of {{download; rsync to rest of cluster}} with parallel downloads on all nodes of the cluster. * Replace instances of {code} for node in $NODES; do command sleep 0.3 done wait{code} with simpler calls to {{pssh}}. * Remove the [linear backoff | https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665] when we wait for SSH availability now that we are already waiting for EC2 status checks to clear before testing SSH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6246) spark-ec2 can't handle clusters with 100 nodes
Nicholas Chammas created SPARK-6246: --- Summary: spark-ec2 can't handle clusters with 100 nodes Key: SPARK-6246 URL: https://issues.apache.org/jira/browse/SPARK-6246 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.3.0 Reporter: Nicholas Chammas Priority: Minor This appears to be a new restriction, perhaps resulting from our upgrade of boto. Maybe it's a new restriction from EC2. Not sure yet. We didn't have this issue around the Spark 1.1.0 time frame from what I can remember. I'll track down where the issue is and when it started. Attempting to launch a cluster with 100 slaves yields the following: {code} Spark AMI: ami-35b1885c Launching instances... Launched 100 slaves in us-east-1c, regid = r-9c408776 Launched master in us-east-1c, regid = r-92408778 Waiting for AWS to propagate instance metadata... Waiting for cluster to enter 'ssh-ready' state.ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the maximum number of instance IDs that can be specificied (100). Please specify fewer than 100 instance IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response Traceback (most recent call last): File ./ec2/spark_ec2.py, line 1338, in module main() File ./ec2/spark_ec2.py, line 1330, in main real_main() File ./ec2/spark_ec2.py, line 1170, in real_main cluster_state='ssh-ready' File ./ec2/spark_ec2.py, line 795, in wait_for_cluster_state statuses = conn.get_all_instance_status(instance_ids=[i.id for i in cluster_instances]) File /path/apache/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 737, in get_all_instance_status InstanceStatusSet, verb='POST') File /path/apache/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1204, in get_object raise self.ResponseError(response.status, response.reason, body) boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the maximum number of instance IDs that can be specificied (100). Please specify fewer than 100 instance IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response {code} This problem seems to be with {{get_all_instance_status()}}, though I am not sure if other methods are affected too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6246) spark-ec2 can't handle clusters with 100 nodes
[ https://issues.apache.org/jira/browse/SPARK-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354969#comment-14354969 ] Nicholas Chammas commented on SPARK-6246: - FYI [~shivaram]. spark-ec2 can't handle clusters with 100 nodes Key: SPARK-6246 URL: https://issues.apache.org/jira/browse/SPARK-6246 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.3.0 Reporter: Nicholas Chammas Priority: Minor This appears to be a new restriction, perhaps resulting from our upgrade of boto. Maybe it's a new restriction from EC2. Not sure yet. We didn't have this issue around the Spark 1.1.0 time frame from what I can remember. I'll track down where the issue is and when it started. Attempting to launch a cluster with 100 slaves yields the following: {code} Spark AMI: ami-35b1885c Launching instances... Launched 100 slaves in us-east-1c, regid = r-9c408776 Launched master in us-east-1c, regid = r-92408778 Waiting for AWS to propagate instance metadata... Waiting for cluster to enter 'ssh-ready' state.ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the maximum number of instance IDs that can be specificied (100). Please specify fewer than 100 instance IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response Traceback (most recent call last): File ./ec2/spark_ec2.py, line 1338, in module main() File ./ec2/spark_ec2.py, line 1330, in main real_main() File ./ec2/spark_ec2.py, line 1170, in real_main cluster_state='ssh-ready' File ./ec2/spark_ec2.py, line 795, in wait_for_cluster_state statuses = conn.get_all_instance_status(instance_ids=[i.id for i in cluster_instances]) File /path/apache/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 737, in get_all_instance_status InstanceStatusSet, verb='POST') File /path/apache/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1204, in get_object raise self.ResponseError(response.status, response.reason, body) boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the maximum number of instance IDs that can be specificied (100). Please specify fewer than 100 instance IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response {code} This problem seems to be with {{get_all_instance_status()}}, though I am not sure if other methods are affected too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4325) Improve spark-ec2 cluster launch times
[ https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas reopened SPARK-4325: - Reopening after updating contains issue links. Improve spark-ec2 cluster launch times -- Key: SPARK-4325 URL: https://issues.apache.org/jira/browse/SPARK-4325 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.3.0 There are several optimizations we know we can make to [{{setup.sh}} | https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches faster. There are also some improvements to the AMIs that will help a lot. Potential improvements: * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This will reduce or eliminate SSH wait time and Ganglia init time. * Replace instances of {{download; rsync to rest of cluster}} with parallel downloads on all nodes of the cluster. * Replace instances of {code} for node in $NODES; do command sleep 0.3 done wait{code} with simpler calls to {{pssh}}. * Remove the [linear backoff | https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665] when we wait for SSH availability now that we are already waiting for EC2 status checks to clear before testing SSH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354991#comment-14354991 ] Nicholas Chammas commented on SPARK-6220: - Another thought to add, there are options for running instances on dedicated hardware and securing provisioned IOPs that we are likely (well, I am likely) to use. Those could also grow into top-level options, making our option list really long. If we go with the original suggestion here and provide some generic way to pass those options through, perhaps it makes sense to invest in SPARK-925 at the same time so that users in most cases would just specify those options in a file and not have to fidget with very long command line parameters. A command-line equivalent for passing options through will still be needed of course, but it won't be as big of a deal if people have to type some kind of quasi-JSON or YAML since they have the config file as well. Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5312) Use sbt to detect new or changed public classes in PRs
[ https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-5312: Description: We currently use an [unwieldy grep/sed contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] to detect new public classes in PRs. -Apparently, sbt lets you get a list of public classes [much more directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via {{show compile:discoveredMainClasses}}. We should use that instead.- There is a tool called [ClassUtil|http://software.clapper.org/classutil/] that seems to help give this kind of information much more directly. We should look into using that. was: We currently use an [unwieldy grep/sed contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] to detect new public classes in PRs. Apparently, sbt lets you get a list of public classes [much more directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via {{show compile:discoveredMainClasses}}. We should use that instead. Use sbt to detect new or changed public classes in PRs -- Key: SPARK-5312 URL: https://issues.apache.org/jira/browse/SPARK-5312 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Nicholas Chammas Priority: Minor We currently use an [unwieldy grep/sed contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] to detect new public classes in PRs. -Apparently, sbt lets you get a list of public classes [much more directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via {{show compile:discoveredMainClasses}}. We should use that instead.- There is a tool called [ClassUtil|http://software.clapper.org/classutil/] that seems to help give this kind of information much more directly. We should look into using that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs
[ https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355622#comment-14355622 ] Nicholas Chammas commented on SPARK-5312: - Thanks for looking into this [~boyork]. I'm looking forward to see what comes of it. The goal, as you hinted at, is basically to give reviewers a complement to the MIMA check that lets them see public API changes for each PR very easily. Use sbt to detect new or changed public classes in PRs -- Key: SPARK-5312 URL: https://issues.apache.org/jira/browse/SPARK-5312 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Nicholas Chammas Priority: Minor We currently use an [unwieldy grep/sed contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] to detect new public classes in PRs. Apparently, sbt lets you get a list of public classes [much more directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via {{show compile:discoveredMainClasses}}. We should use that instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6246) spark-ec2 can't handle clusters with 100 nodes
[ https://issues.apache.org/jira/browse/SPARK-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355642#comment-14355642 ] Nicholas Chammas commented on SPARK-6246: - I dunno, I haven't looked into the problem yet (been out all day), but I'm surprised that everything else works with 100 nodes: creating nodes, destroying them, getting them. It's just the status check call. If we have to, sure I'll batch the calls. But I suspect there's a better way to do things. I'm surprised boto doesn't just abstract this problem away. Anyway, I'll look into it and report back. spark-ec2 can't handle clusters with 100 nodes Key: SPARK-6246 URL: https://issues.apache.org/jira/browse/SPARK-6246 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.3.0 Reporter: Nicholas Chammas Priority: Minor This appears to be a new restriction, perhaps resulting from our upgrade of boto. Maybe it's a new restriction from EC2. Not sure yet. We didn't have this issue around the Spark 1.1.0 time frame from what I can remember. I'll track down where the issue is and when it started. Attempting to launch a cluster with 100 slaves yields the following: {code} Spark AMI: ami-35b1885c Launching instances... Launched 100 slaves in us-east-1c, regid = r-9c408776 Launched master in us-east-1c, regid = r-92408778 Waiting for AWS to propagate instance metadata... Waiting for cluster to enter 'ssh-ready' state.ERROR:boto:400 Bad Request ERROR:boto:?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the maximum number of instance IDs that can be specificied (100). Please specify fewer than 100 instance IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response Traceback (most recent call last): File ./ec2/spark_ec2.py, line 1338, in module main() File ./ec2/spark_ec2.py, line 1330, in main real_main() File ./ec2/spark_ec2.py, line 1170, in real_main cluster_state='ssh-ready' File ./ec2/spark_ec2.py, line 795, in wait_for_cluster_state statuses = conn.get_all_instance_status(instance_ids=[i.id for i in cluster_instances]) File /path/apache/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 737, in get_all_instance_status InstanceStatusSet, verb='POST') File /path/apache/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1204, in get_object raise self.ResponseError(response.status, response.reason, body) boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request ?xml version=1.0 encoding=UTF-8? ResponseErrorsErrorCodeInvalidRequest/CodeMessage101 exceeds the maximum number of instance IDs that can be specificied (100). Please specify fewer than 100 instance IDs./Message/Error/ErrorsRequestID217fd6ff-9afa-4e91-86bc-ab16fcc442d8/RequestID/Response {code} This problem seems to be with {{get_all_instance_status()}}, though I am not sure if other methods are affected too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5313) Create simple framework for highlighting changes introduced in a PR
[ https://issues.apache.org/jira/browse/SPARK-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355819#comment-14355819 ] Nicholas Chammas commented on SPARK-5313: - I had an idea to generalize the process of comparing any given property across {{master}} and a given PR and displaying the result on the PR. I'll update the issue links from contains to relates to, because that's all it is--an abstracted way for our Jenkins script to report on PR characteristics. Create simple framework for highlighting changes introduced in a PR --- Key: SPARK-5313 URL: https://issues.apache.org/jira/browse/SPARK-5313 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Nicholas Chammas Priority: Minor For any given PR, we may want to run a bunch of checks along the following lines: * Show property X of {{master}} * Show the same property X of this PR * Call out any differences on the GitHub page It might be helpful to write a simple function that takes any check -- itself represented as a function -- as input, runs the check on master and the PR, and outputs the diff. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4325) Improve spark-ec2 cluster launch times
[ https://issues.apache.org/jira/browse/SPARK-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-4325: Description: This is an umbrella task to capture several pieces of work related to significantly improving spark-ec2 cluster launch times. There are several optimizations we know we can make to [{{setup.sh}} | https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches faster. There are also some improvements to the AMIs that will help a lot. Potential improvements: * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This will reduce or eliminate SSH wait time and Ganglia init time. * Replace instances of {{download; rsync to rest of cluster}} with parallel downloads on all nodes of the cluster. * Replace instances of {code} for node in $NODES; do command sleep 0.3 done wait{code} with simpler calls to {{pssh}}. * Remove the [linear backoff | https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665] when we wait for SSH availability now that we are already waiting for EC2 status checks to clear before testing SSH. was: There are several optimizations we know we can make to [{{setup.sh}} | https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches faster. There are also some improvements to the AMIs that will help a lot. Potential improvements: * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This will reduce or eliminate SSH wait time and Ganglia init time. * Replace instances of {{download; rsync to rest of cluster}} with parallel downloads on all nodes of the cluster. * Replace instances of {code} for node in $NODES; do command sleep 0.3 done wait{code} with simpler calls to {{pssh}}. * Remove the [linear backoff | https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665] when we wait for SSH availability now that we are already waiting for EC2 status checks to clear before testing SSH. Improve spark-ec2 cluster launch times -- Key: SPARK-4325 URL: https://issues.apache.org/jira/browse/SPARK-4325 Project: Spark Issue Type: Umbrella Components: EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.3.0 This is an umbrella task to capture several pieces of work related to significantly improving spark-ec2 cluster launch times. There are several optimizations we know we can make to [{{setup.sh}} | https://github.com/mesos/spark-ec2/blob/v4/setup.sh] to make cluster launches faster. There are also some improvements to the AMIs that will help a lot. Potential improvements: * Upgrade the Spark AMIs and pre-install tools like Ganglia on them. This will reduce or eliminate SSH wait time and Ganglia init time. * Replace instances of {{download; rsync to rest of cluster}} with parallel downloads on all nodes of the cluster. * Replace instances of {code} for node in $NODES; do command sleep 0.3 done wait{code} with simpler calls to {{pssh}}. * Remove the [linear backoff | https://github.com/apache/spark/blob/b32734e12d5197bad26c080e529edd875604c6fb/ec2/spark_ec2.py#L665] when we wait for SSH availability now that we are already waiting for EC2 status checks to clear before testing SSH. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6219) Expand Python lint checks to check for compilation errors
[ https://issues.apache.org/jira/browse/SPARK-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353325#comment-14353325 ] Nicholas Chammas commented on SPARK-6219: - That's a good point, I haven't checked to see what's already covered in that way by unit tests. At the very least, I can say that this will catch stuff in spark-ec2 and examples that unit tests currently do not cover. Also, it runs very, very quickly. Expand Python lint checks to check for compilation errors -- Key: SPARK-6219 URL: https://issues.apache.org/jira/browse/SPARK-6219 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Priority: Minor An easy lint check for Python would be to make sure the stuff at least compiles. That will catch only the most egregious errors, but it should help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354217#comment-14354217 ] Nicholas Chammas commented on SPARK-6220: - I took another look at the 2 boto methods we'd be passing these options to. * [{{boto.ec2.image.Image.run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run] * [{{boto.ec2.connection.EC2Connection.request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] The parameter types they take are quite varied, from {{bool}} to {{string}} to {{list(string)}} to {{list(boto.ec2.networkinterface.NetworkInterfaceSpecification)}}. Covering them generically, even just a subset of them, would require us to take input that can be type cast somehow--maybe some kind of stripped-down JSON. I'm not sure we want to do that to spark-ec2. Maybe instead I should just add the options I need to support {{instance_profile_arn}} / {{instance_profile_name}} (for IAM support) and {{instance_initiated_shutdown_behavior}} (for self-terminating clusters) and call it a day. [~shivaram], [~joshrosen], [~pwendell]: What do y'all think? Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6206) spark-ec2 script reporting SSL error?
[ https://issues.apache.org/jira/browse/SPARK-6206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352481#comment-14352481 ] Nicholas Chammas commented on SPARK-6206: - OK, let us know what you find, [~Joe6521]. In general, please try to validate your issue on the user list or on Stack Overflow before reporting it here, unless you are really sure you've found a problem with Spark (as opposed to your environment). spark-ec2 script reporting SSL error? - Key: SPARK-6206 URL: https://issues.apache.org/jira/browse/SPARK-6206 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.0 Reporter: Joe O I have been using the spark-ec2 script for several months with no problems. Recently, when executing a script to launch a cluster I got the following error: {code} [Errno 185090050] _ssl.c:344: error:0B084002:x509 certificate routines:X509_load_cert_crl_file:system lib {code} Nothing launches, the script exits. I am not sure if something on machine changed, this is a problem with EC2's certs, or a problem with Python. It occurs 100% of the time, and has been occurring over at least the last two days. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6220: Description: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 --ec2-instance-option {code} was: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 --ec2-instance-option {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352489#comment-14352489 ] Nicholas Chammas commented on SPARK-6220: - cc [~joshrosen] and [~shivaram] for feedback. The immediate motivation for this is the work I'm doing on automating spark-perf runs. As part of an automated spark-perf run, I'd like to: * set {{instance_initiated_shutdown_behavior=terminate}} for the non-spot instances launched by spark-ec2 (i.e. the master), so that the cluster can self-terminate without needing outside input * set {{instance_profile_arn}} for the master so that spark-perf results can be uploaded to S3 without having to handle AWS user credentials, via use of IAM profiles Since my use case is specialized, I didn't think it was worth adding top-level options for these EC2 features. So I generalized the idea to support any EC2 option supported by boto. Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 --ec2-instance-option {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6220: Description: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} was: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 --ec2-instance-option {code} Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6220: Description: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} was: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed
[jira] [Updated] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6220: Description: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} was: There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.run_instances] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I
[jira] [Created] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse
Nicholas Chammas created SPARK-6218: --- Summary: Upgrade spark-ec2 from optparse to argparse Key: SPARK-6218 URL: https://issues.apache.org/jira/browse/SPARK-6218 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor spark-ec2 [currently uses optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43]. In Python 2.7, optparse was [deprecated in favor of argparse|https://docs.python.org/2/library/optparse.html]. This is the main motivation for moving away from optparse. Additionally, upgrading to argparse provides some [additional benefits noted in the docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. The one we are mostly likely to benefit from is the better input validation. argparse is not include with Python 2.6, which is currently the minimum version of Python we support in Spark, but it can easily be downloaded by spark-ec2 with the work that has already been done in SPARK-6191. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse
[ https://issues.apache.org/jira/browse/SPARK-6218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352331#comment-14352331 ] Nicholas Chammas commented on SPARK-6218: - [~shivaram], [~joshrosen]: What do you think? Upgrade spark-ec2 from optparse to argparse --- Key: SPARK-6218 URL: https://issues.apache.org/jira/browse/SPARK-6218 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor spark-ec2 [currently uses optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43]. In Python 2.7, optparse was [deprecated in favor of argparse|https://docs.python.org/2/library/optparse.html]. This is the main motivation for moving away from optparse. Additionally, upgrading to argparse provides some [additional benefits noted in the docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. The one we are mostly likely to benefit from is the better input validation. Specifically, being able to cleanly tie each input parameter to a validation method will cut down the input validation code currently spread out across the script. argparse is not include with Python 2.6, which is currently the minimum version of Python we support in Spark, but it can easily be downloaded by spark-ec2 with the work that has already been done in SPARK-6191. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6218) Upgrade spark-ec2 from optparse to argparse
[ https://issues.apache.org/jira/browse/SPARK-6218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6218: Description: spark-ec2 [currently uses optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43]. In Python 2.7, optparse was [deprecated in favor of argparse|https://docs.python.org/2/library/optparse.html]. This is the main motivation for moving away from optparse. Additionally, upgrading to argparse provides some [additional benefits noted in the docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. The one we are mostly likely to benefit from is the better input validation. Specifically, being able to cleanly tie each input parameter to a validation method will cut down the input validation code currently spread out across the script. argparse is not include with Python 2.6, which is currently the minimum version of Python we support in Spark, but it can easily be downloaded by spark-ec2 with the work that has already been done in SPARK-6191. was: spark-ec2 [currently uses optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43]. In Python 2.7, optparse was [deprecated in favor of argparse|https://docs.python.org/2/library/optparse.html]. This is the main motivation for moving away from optparse. Additionally, upgrading to argparse provides some [additional benefits noted in the docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. The one we are mostly likely to benefit from is the better input validation. argparse is not include with Python 2.6, which is currently the minimum version of Python we support in Spark, but it can easily be downloaded by spark-ec2 with the work that has already been done in SPARK-6191. Upgrade spark-ec2 from optparse to argparse --- Key: SPARK-6218 URL: https://issues.apache.org/jira/browse/SPARK-6218 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor spark-ec2 [currently uses optparse|https://github.com/apache/spark/blob/55b1b32dc8b9b25deea8e5864b53fe802bb92741/ec2/spark_ec2.py#L43]. In Python 2.7, optparse was [deprecated in favor of argparse|https://docs.python.org/2/library/optparse.html]. This is the main motivation for moving away from optparse. Additionally, upgrading to argparse provides some [additional benefits noted in the docs|https://argparse.googlecode.com/svn/trunk/doc/argparse-vs-optparse.html]. The one we are mostly likely to benefit from is the better input validation. Specifically, being able to cleanly tie each input parameter to a validation method will cut down the input validation code currently spread out across the script. argparse is not include with Python 2.6, which is currently the minimum version of Python we support in Spark, but it can easily be downloaded by spark-ec2 with the work that has already been done in SPARK-6191. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14352524#comment-14352524 ] Nicholas Chammas commented on SPARK-6220: - As far as places where we create instances, yes, those are the 2 calls we use. Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6219) Expand Python lint checks to check for compilation errors
Nicholas Chammas created SPARK-6219: --- Summary: Expand Python lint checks to check for compilation errors Key: SPARK-6219 URL: https://issues.apache.org/jira/browse/SPARK-6219 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Priority: Minor An easy lint check for Python would be to make sure the stuff at least compiles. That will catch only the most egregious errors, but it should help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6191) Generalize spark-ec2's ability to download libraries from PyPI
[ https://issues.apache.org/jira/browse/SPARK-6191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6191: Description: Right now we have a method to specifically download boto. Let's generalize it so it's easy to download additional libraries if we want. Likely use cases: * Downloading PyYAML for was:Right now we have a method to specifically download boto. Let's generalize it so it's easy to download additional libraries if we want. Generalize spark-ec2's ability to download libraries from PyPI -- Key: SPARK-6191 URL: https://issues.apache.org/jira/browse/SPARK-6191 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor Right now we have a method to specifically download boto. Let's generalize it so it's easy to download additional libraries if we want. Likely use cases: * Downloading PyYAML for -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6191) Generalize spark-ec2's ability to download libraries from PyPI
Nicholas Chammas created SPARK-6191: --- Summary: Generalize spark-ec2's ability to download libraries from PyPI Key: SPARK-6191 URL: https://issues.apache.org/jira/browse/SPARK-6191 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor Right now we have a method to specifically download boto. Let's generalize it so it's easy to download additional libraries if we want. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6191) Generalize spark-ec2's ability to download libraries from PyPI
[ https://issues.apache.org/jira/browse/SPARK-6191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-6191: Description: Right now we have a method to specifically download boto. Let's generalize it so it's easy to download additional libraries if we want. Likely use cases: * Downloading PyYAML to allow spark-ec2 configs to be persisted as a TAML file. (SPARK-925) * Downloading argparse to clean up / modernize our option parsing. was: Right now we have a method to specifically download boto. Let's generalize it so it's easy to download additional libraries if we want. Likely use cases: * Downloading PyYAML for Generalize spark-ec2's ability to download libraries from PyPI -- Key: SPARK-6191 URL: https://issues.apache.org/jira/browse/SPARK-6191 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor Right now we have a method to specifically download boto. Let's generalize it so it's easy to download additional libraries if we want. Likely use cases: * Downloading PyYAML to allow spark-ec2 configs to be persisted as a TAML file. (SPARK-925) * Downloading argparse to clean up / modernize our option parsing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator
[ https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349577#comment-14349577 ] Nicholas Chammas commented on SPARK-3369: - {quote} How about breaking backward compatibility {quote} The Spark project has made a big deal out of promising API stability. People trust that they can upgrade their version of Spark without breaking any of their code. Breaking this promise would shake users' trust in the project. That's a big deal. Overall, it's not worth whatever benefit we hope to get out of fixing this issue. This issue is tagged for 2+ and that seems to be the correct thing to do. Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator - Key: SPARK-3369 URL: https://issues.apache.org/jira/browse/SPARK-3369 Project: Spark Issue Type: Improvement Components: Java API Affects Versions: 1.0.2, 1.2.1 Reporter: Sean Owen Assignee: Sean Owen Labels: breaking_change Attachments: FlatMapIterator.patch {{mapPartitions}} in the Scala RDD API takes a function that transforms an {{Iterator}} to an {{Iterator}}: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an {{Iterator}} but is requires to return an {{Iterable}}, which is a stronger condition and appears inconsistent. It's a problematic inconsistent though because this seems to require copying all of the input into memory in order to create an object that can be iterated many times, since the input does not afford this itself. Similarity for other {{mapPartitions*}} methods and other {{*FlatMapFunctions}}s in Java. (Is there a reason for this difference that I'm overlooking?) If I'm right that this was inadvertent inconsistency, then the big issue here is that of course this is part of a public API. Workarounds I can think of: Promise that Spark will only call {{iterator()}} once, so implementors can use a hacky {{IteratorIterable}} that returns the same {{Iterator}}. Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the desired signature, and deprecate existing ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6193) Speed up how spark-ec2 searches for clusters
Nicholas Chammas created SPARK-6193: --- Summary: Speed up how spark-ec2 searches for clusters Key: SPARK-6193 URL: https://issues.apache.org/jira/browse/SPARK-6193 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor {{spark-ec2}} currently pulls down [info for all instances|https://github.com/apache/spark/blob/eb48fd6e9d55fb034c00e61374bb9c2a86a82fb8/ec2/spark_ec2.py#L620] and searches locally for the target cluster. Instead, it should push those filters up when querying EC2. For AWS accounts with hundreds of active instances, there is a difference of many seconds between the two. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5473) Expose SSH failures after status checks pass
[ https://issues.apache.org/jira/browse/SPARK-5473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-5473: Description: If there is some fatal problem with launching a cluster, `spark-ec2` just hangs without giving the user useful feedback on what the problem is. This PR exposes the output of the SSH calls to the user if the SSH test fails during cluster launch for any reason but the instance status checks are all green. For example: ``` $ ./ec2/spark-ec2 -k key -i /incorrect/path/identity.pem --instance-type m3.medium --slaves 1 --zone us-east-1c launch spark-test Setting up security groups... Searching for existing cluster spark-test... Spark AMI: ami-35b1885c Launching instances... Launched 1 slaves in us-east-1c, regid = r-7dadd096 Launched master in us-east-1c, regid = r-fcadd017 Waiting for cluster to enter 'ssh-ready' state... Warning: SSH connection error. (This could be temporary.) Host: 127.0.0.1 SSH return code: 255 SSH output: Warning: Identity file /incorrect/path/identity.pem not accessible: No such file or directory. Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts. Permission denied (publickey). ``` This should give users enough information when some unrecoverable error occurs during launch so they can know to abort the launch. This will help avoid situations like the ones reported [here on Stack Overflow](http://stackoverflow.com/q/28002443/) and [here on the user list](http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3c1422323829398-21381.p...@n3.nabble.com%3E), where the users couldn't tell what the problem was because it was being hidden by `spark-ec2`. This is a usability improvement that should be backported to 1.2. Expose SSH failures after status checks pass Key: SPARK-5473 URL: https://issues.apache.org/jira/browse/SPARK-5473 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.2.0 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Priority: Minor Fix For: 1.3.0 If there is some fatal problem with launching a cluster, `spark-ec2` just hangs without giving the user useful feedback on what the problem is. This PR exposes the output of the SSH calls to the user if the SSH test fails during cluster launch for any reason but the instance status checks are all green. For example: ``` $ ./ec2/spark-ec2 -k key -i /incorrect/path/identity.pem --instance-type m3.medium --slaves 1 --zone us-east-1c launch spark-test Setting up security groups... Searching for existing cluster spark-test... Spark AMI: ami-35b1885c Launching instances... Launched 1 slaves in us-east-1c, regid = r-7dadd096 Launched master in us-east-1c, regid = r-fcadd017 Waiting for cluster to enter 'ssh-ready' state... Warning: SSH connection error. (This could be temporary.) Host: 127.0.0.1 SSH return code: 255 SSH output: Warning: Identity file /incorrect/path/identity.pem not accessible: No such file or directory. Warning: Permanently added '127.0.0.1' (RSA) to the list of known hosts. Permission denied (publickey). ``` This should give users enough information when some unrecoverable error occurs during launch so they can know to abort the launch. This will help avoid situations like the ones reported [here on Stack Overflow](http://stackoverflow.com/q/28002443/) and [here on the user list](http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3c1422323829398-21381.p...@n3.nabble.com%3E), where the users couldn't tell what the problem was because it was being hidden by `spark-ec2`. This is a usability improvement that should be backported to 1.2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs
[ https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-3533: Target Version/s: 1.4.0 Add saveAsTextFileByKey() method to RDDs Key: SPARK-3533 URL: https://issues.apache.org/jira/browse/SPARK-3533 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Nicholas Chammas Users often have a single RDD of key-value pairs that they want to save to multiple locations based on the keys. For example, say I have an RDD like this: {code} a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda x: x[0]) a.collect() [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] a.keys().distinct().collect() ['B', 'F', 'N'] {code} Now I want to write the RDD out to different paths depending on the keys, so that I have one output directory per distinct key. Each output directory could potentially have multiple {{part-}} files, one per RDD partition. So the output would look something like: {code} /path/prefix/B [/part-1, /part-2, etc] /path/prefix/F [/part-1, /part-2, etc] /path/prefix/N [/part-1, /part-2, etc] {code} Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't straightforward. It's not clear if it's even possible at all in PySpark. Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes it easy to save RDDs out to multiple locations at once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: spark-ec2 default to Hadoop 2
I might take a look at that pr if we get around to doing some perf testing of Spark on various resource managers. 2015년 3월 2일 (월) 오후 12:22, Shivaram Venkataraman shiva...@eecs.berkeley.edu님이 작성: FWIW there is a PR open to add support for Hadoop 2.4 to spark-ec2 scripts at https://github.com/mesos/spark-ec2/pull/77 -- But it hasnt' received much review or testing to be merged. Thanks Shivaram On Sun, Mar 1, 2015 at 11:49 PM, Sean Owen so...@cloudera.com wrote: I agree with that. My anecdotal impression is that Hadoop 1.x usage out there is maybe a couple percent, and so we should shift towards 2.x at least as defaults. On Sun, Mar 1, 2015 at 10:59 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: https://github.com/apache/spark/blob/fd8d283eeb98e310b1e85ef8c3a8af 9e547ab5e0/ec2/spark_ec2.py#L162-L164 Is there any reason we shouldn't update the default Hadoop major version in spark-ec2 to 2? Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jira] [Commented] (SPARK-882) Have link for feedback/suggestions in docs
[ https://issues.apache.org/jira/browse/SPARK-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344475#comment-14344475 ] Nicholas Chammas commented on SPARK-882: Is the intended use here that users could submit corrections easily without having to open a JIRA/PR? I think that's a great idea; it lowers the barrier to providing feedback on a high visibility item like the docs. Couple of questions: 1. Is integration with 3rd party tools like UserVoice or Disqus allowed? Actually, it might be really sweet if some simple, in-page feedback form automatically submitted a JIRA issue with the appropriate tags and info. 2. I assume the docs proper are the priority, right? Do we want to do this for the main site as well? Have link for feedback/suggestions in docs -- Key: SPARK-882 URL: https://issues.apache.org/jira/browse/SPARK-882 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Cogan It would be cool to have a link at the top of the docs for feedback/suggestions/errors. I bet we'd get a lot of interesting stuff from that and it could be a good way to crowdsource correctness checking, since a lot of us that write them never have to use them. Something to the right of the main top nav might be good. [~andyk] [~matei] - what do you guys think? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2545) Add a diagnosis mode for closures to figure out what they're bringing in
[ https://issues.apache.org/jira/browse/SPARK-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344482#comment-14344482 ] Nicholas Chammas commented on SPARK-2545: - [~adav] - Would this potentially also be something to use in the REPL? If I understand correctly, the situation with closures is more complicated there, right. Add a diagnosis mode for closures to figure out what they're bringing in Key: SPARK-2545 URL: https://issues.apache.org/jira/browse/SPARK-2545 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Aaron Davidson Today, it's pretty hard to figure out why your closure is bigger than expected, because it's not obvious what objects are being included or who is including them. We should have some sort of diagnosis available to users with very large closures that displays the contents of the closure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2545) Add a diagnosis mode for closures to figure out what they're bringing in
[ https://issues.apache.org/jira/browse/SPARK-2545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344504#comment-14344504 ] Nicholas Chammas commented on SPARK-2545: - cc [~tobias.schlatter] Add a diagnosis mode for closures to figure out what they're bringing in Key: SPARK-2545 URL: https://issues.apache.org/jira/browse/SPARK-2545 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Aaron Davidson Today, it's pretty hard to figure out why your closure is bigger than expected, because it's not obvious what objects are being included or who is including them. We should have some sort of diagnosis available to users with very large closures that displays the contents of the closure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2095) sc.getExecutorCPUCounts()
[ https://issues.apache.org/jira/browse/SPARK-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14344480#comment-14344480 ] Nicholas Chammas commented on SPARK-2095: - cc [~pwendell], [~joshrosen] This seems like a useful thing to have, though you can accomplish something similar (though not as explicitly) with {{sc.defaultParallelism}}, which defaults to the number of executor cores in your cluster. sc.getExecutorCPUCounts() - Key: SPARK-2095 URL: https://issues.apache.org/jira/browse/SPARK-2095 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Daniel Darabos Priority: Minor We can get the amount of total and free memory (via getExecutorMemoryStatus) and blocks stored (via getExecutorStorageStatus) on the executors. I would also like to be able to query the available CPU per executor. This would be useful in dynamically deciding the number of partitions at the start of an operation. What do you think? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
spark-ec2 default to Hadoop 2
https://github.com/apache/spark/blob/fd8d283eeb98e310b1e85ef8c3a8af9e547ab5e0/ec2/spark_ec2.py#L162-L164 Is there any reason we shouldn't update the default Hadoop major version in spark-ec2 to 2? Nick
[jira] [Commented] (SPARK-6077) Multiple spark streaming tabs on UI when reuse the same sparkcontext
[ https://issues.apache.org/jira/browse/SPARK-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342704#comment-14342704 ] Nicholas Chammas commented on SPARK-6077: - Please disregard the comments on SPARK-2463 and focus on the description. The comments veer off into a separate issue from the one put forward in the description. Multiple spark streaming tabs on UI when reuse the same sparkcontext Key: SPARK-6077 URL: https://issues.apache.org/jira/browse/SPARK-6077 Project: Spark Issue Type: Bug Components: Streaming, Web UI Reporter: zhichao-li Priority: Minor Currently we would create a new streaming tab for each streamingContext even if there's already one on the same sparkContext which would cause duplicate StreamingTab created and none of them is taking effect. snapshot: https://www.dropbox.com/s/t4gd6hqyqo0nivz/bad%20multiple%20streamings.png?dl=0 How to reproduce: 1) import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.storage.StorageLevel val ssc = new StreamingContext(sc, Seconds(1)) val lines = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split( )) val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() . 2) ssc.stop(false) val ssc = new StreamingContext(sc, Seconds(1)) val lines = ssc.socketTextStream(localhost, , StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split( )) val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2463) Creating then stopping StreamingContext multiple times from shell generates duplicate Streaming tabs in UI
[ https://issues.apache.org/jira/browse/SPARK-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342714#comment-14342714 ] Nicholas Chammas commented on SPARK-2463: - For people reading through these comments, please keep in mind that this issue is describing a problem relating to starting and then stopping a streaming context multiple times. There is only ever 1 context running at a time. *This issue has nothing to do with concurrently running contexts*, at least not directly. Creating then stopping StreamingContext multiple times from shell generates duplicate Streaming tabs in UI -- Key: SPARK-2463 URL: https://issues.apache.org/jira/browse/SPARK-2463 Project: Spark Issue Type: Bug Components: Streaming, Web UI Affects Versions: 1.0.1 Reporter: Nicholas Chammas Assignee: Josh Rosen Start a {{StreamingContext}} from the interactive shell and then stop it. Go to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for Streaming. Now from the same shell, create and start a new {{StreamingContext}}. There will now be a duplicate tab for Streaming in the UI. Repeating this process generates additional Streaming tabs. They all link to the same information. *Please note* that the issue of concurrently running contexts discussed in the comments below is a completely separate issue. *This issue has nothing to do with concurrently running contexts.* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2463) Creating then stopping StreamingContext multiple times from shell generates duplicate Streaming tabs in UI
[ https://issues.apache.org/jira/browse/SPARK-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-2463: Description: Start a {{StreamingContext}} from the interactive shell and then stop it. Go to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for Streaming. Now from the same shell, create and start a new {{StreamingContext}} (and then stop it, if you want). There will now be a duplicate tab for Streaming in the UI. Repeating this process generates additional Streaming tabs. They all link to the same information. *Please note* that the issue of concurrently running contexts discussed in the comments below is a completely separate issue. *This issue has nothing to do with concurrently running streaming contexts.* was: Start a {{StreamingContext}} from the interactive shell and then stop it. Go to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for Streaming. Now from the same shell, create and start a new {{StreamingContext}}. There will now be a duplicate tab for Streaming in the UI. Repeating this process generates additional Streaming tabs. They all link to the same information. *Please note* that the issue of concurrently running contexts discussed in the comments below is a completely separate issue. *This issue has nothing to do with concurrently running contexts.* Creating then stopping StreamingContext multiple times from shell generates duplicate Streaming tabs in UI -- Key: SPARK-2463 URL: https://issues.apache.org/jira/browse/SPARK-2463 Project: Spark Issue Type: Bug Components: Streaming, Web UI Affects Versions: 1.0.1 Reporter: Nicholas Chammas Assignee: Josh Rosen Start a {{StreamingContext}} from the interactive shell and then stop it. Go to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for Streaming. Now from the same shell, create and start a new {{StreamingContext}} (and then stop it, if you want). There will now be a duplicate tab for Streaming in the UI. Repeating this process generates additional Streaming tabs. They all link to the same information. *Please note* that the issue of concurrently running contexts discussed in the comments below is a completely separate issue. *This issue has nothing to do with concurrently running streaming contexts.* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2463) Creating then stopping StreamingContext multiple times from shell generates duplicate Streaming tabs in UI
[ https://issues.apache.org/jira/browse/SPARK-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-2463: Description: Start a {{StreamingContext}} from the interactive shell and then stop it. Go to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for Streaming. Now from the same shell, create and start a new {{StreamingContext}}. There will now be a duplicate tab for Streaming in the UI. Repeating this process generates additional Streaming tabs. They all link to the same information. *Please note* that the issue of concurrently running contexts discussed in the comments below is a completely separate issue. *This issue has nothing to do with concurrently running contexts.* was: Start a {{StreamingContext}} from the interactive shell and then stop it. Go to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for Streaming. Now from the same shell, create and start a new {{StreamingContext}}. There will now be a duplicate tab for Streaming in the UI. Repeating this process generates additional Streaming tabs. They all link to the same information. Creating then stopping StreamingContext multiple times from shell generates duplicate Streaming tabs in UI -- Key: SPARK-2463 URL: https://issues.apache.org/jira/browse/SPARK-2463 Project: Spark Issue Type: Bug Components: Streaming, Web UI Affects Versions: 1.0.1 Reporter: Nicholas Chammas Assignee: Josh Rosen Start a {{StreamingContext}} from the interactive shell and then stop it. Go to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for Streaming. Now from the same shell, create and start a new {{StreamingContext}}. There will now be a duplicate tab for Streaming in the UI. Repeating this process generates additional Streaming tabs. They all link to the same information. *Please note* that the issue of concurrently running contexts discussed in the comments below is a completely separate issue. *This issue has nothing to do with concurrently running contexts.* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6084) spark-shell broken on Windows
[ https://issues.apache.org/jira/browse/SPARK-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341787#comment-14341787 ] Nicholas Chammas commented on SPARK-6084: - Ah, there's also SPARK-5396, though it's in Russian (?) so I'm not sure if the error is the same. spark-shell broken on Windows - Key: SPARK-6084 URL: https://issues.apache.org/jira/browse/SPARK-6084 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0, 1.2.1 Environment: Windows 7, Scala 2.11.4, Java 1.8 Reporter: Nicholas Chammas Labels: windows Original report here: http://stackoverflow.com/questions/28747795/spark-launch-find-version For both spark-1.2.0-bin-hadoop2.4 and spark-1.2.1-bin-hadoop2.4, doing this: {code} bin\spark-shell.cmd {code} Yields the following error: {code} find: 'version': No such file or directory else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341789#comment-14341789 ] Nicholas Chammas commented on SPARK-5389: - Yeah, I think we found another instance of this in SPARK-6084 / [here|http://stackoverflow.com/questions/28747795/spark-launch-find-version]. spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Reporter: Yana Kadiyska Priority: Trivial Attachments: SparkShell_Win7.JPG spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial sine calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6084) spark-shell broken on Windows
[ https://issues.apache.org/jira/browse/SPARK-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas reopened SPARK-6084: - Don't see how this is a dup of SPARK-4833. spark-shell broken on Windows - Key: SPARK-6084 URL: https://issues.apache.org/jira/browse/SPARK-6084 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0, 1.2.1 Environment: Windows 7, Scala 2.11.4, Java 1.8 Reporter: Nicholas Chammas Labels: windows Original report here: http://stackoverflow.com/questions/28747795/spark-launch-find-version For both spark-1.2.0-bin-hadoop2.4 and spark-1.2.1-bin-hadoop2.4, doing this: {code} bin\spark-shell.cmd {code} Yields the following error: {code} find: 'version': No such file or directory else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6084) spark-shell broken on Windows
[ https://issues.apache.org/jira/browse/SPARK-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-6084. - Resolution: Duplicate Resolving as duplicate of SPARK-5389. That seems a more likely match for this than SPARK-4833. spark-shell broken on Windows - Key: SPARK-6084 URL: https://issues.apache.org/jira/browse/SPARK-6084 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0, 1.2.1 Environment: Windows 7, Scala 2.11.4, Java 1.8 Reporter: Nicholas Chammas Labels: windows Original report here: http://stackoverflow.com/questions/28747795/spark-launch-find-version For both spark-1.2.0-bin-hadoop2.4 and spark-1.2.1-bin-hadoop2.4, doing this: {code} bin\spark-shell.cmd {code} Yields the following error: {code} find: 'version': No such file or directory else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5396) Syntax error in spark scripts on windows.
[ https://issues.apache.org/jira/browse/SPARK-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341788#comment-14341788 ] Nicholas Chammas commented on SPARK-5396: - What does that error message say in English? So we can pattern match to similar reports elsewhere. Syntax error in spark scripts on windows. - Key: SPARK-5396 URL: https://issues.apache.org/jira/browse/SPARK-5396 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.0 Environment: Window 7 and Window 8.1. Reporter: Vladimir Protsenko Assignee: Masayoshi TSUZUKI Priority: Critical Fix For: 1.3.0 Attachments: windows7.png, windows8.1.png I made the following steps: 1. downloaded and installed Scala 2.11.5 2. downloaded spark 1.2.0 by git clone git://github.com/apache/spark.git 3. run dev/change-version-to-2.11.sh and mvn -Dscala-2.11 -DskipTests clean package (in git bash) After installation tried to run spark-shell.cmd in cmd shell and it says there is a syntax error in file. The same with spark-shell2.cmd, spark-submit.cmd and spark-submit2.cmd. !windows7.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-5389: Description: spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} was: spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial sine calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. Priority: Major (was: Trivial) Environment: Windows 7 Marking as major since the shell is technically broken. (Trivial is for mostly cosmetic problems.) Reopening since multiple reports of this problem have come in. spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Environment: Windows 7 Reporter: Yana Kadiyska Attachments: SparkShell_Win7.JPG spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas reopened SPARK-5389: - spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Environment: Windows 7 Reporter: Yana Kadiyska Attachments: SparkShell_Win7.JPG spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341790#comment-14341790 ] Nicholas Chammas edited comment on SPARK-5389 at 2/28/15 9:48 PM: -- Marking as major since the shell -is technically broken- is behaving terribly when Java cannot be found. Reopening since multiple reports of this problem have come in. was (Author: nchammas): Marking as major since the shell is technically broken. (Trivial is for mostly cosmetic problems.) Reopening since multiple reports of this problem have come in. spark-shell.cmd does not run from DOS Windows 7 --- Key: SPARK-5389 URL: https://issues.apache.org/jira/browse/SPARK-5389 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0 Environment: Windows 7 Reporter: Yana Kadiyska Attachments: SparkShell_Win7.JPG spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 Marking as trivial since calling spark-shell2.cmd also works fine Attaching a screenshot since the error isn't very useful: {code} spark-1.2.0-bin-cdh4bin\spark-shell.cmd else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6084) spark-shell broken on Windows
[ https://issues.apache.org/jira/browse/SPARK-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341776#comment-14341776 ] Nicholas Chammas commented on SPARK-6084: - I took a look at the linked issue (SPARK-4833) and I don't see how they are duplicates. They both relate to spark-shell and Windows, but the error messages and conditions are different. Here the use is claiming spark-shell fails with an error right away. There, the user is claiming spark-shell runs OK the first time, but then doesn't run a second time. spark-shell broken on Windows - Key: SPARK-6084 URL: https://issues.apache.org/jira/browse/SPARK-6084 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0, 1.2.1 Environment: Windows 7, Scala 2.11.4, Java 1.8 Reporter: Nicholas Chammas Labels: windows Original report here: http://stackoverflow.com/questions/28747795/spark-launch-find-version For both spark-1.2.0-bin-hadoop2.4 and spark-1.2.1-bin-hadoop2.4, doing this: {code} bin\spark-shell.cmd {code} Yields the following error: {code} find: 'version': No such file or directory else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6084) spark-shell broken on Windows
Nicholas Chammas created SPARK-6084: --- Summary: spark-shell broken on Windows Key: SPARK-6084 URL: https://issues.apache.org/jira/browse/SPARK-6084 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.1, 1.2.0 Environment: Windows 7, Scala 2.11.4, Java 1.8 Reporter: Nicholas Chammas Original report here: http://stackoverflow.com/questions/28747795/spark-launch-find-version For both spark-1.2.0-bin-hadoop2.4 and spark-1.2.1-bin-hadoop2.4, doing this: {code} bin\spark-shell.cmd {code} Yields the following error: {code} find: 'version': No such file or directory else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6084) spark-shell broken on Windows
[ https://issues.apache.org/jira/browse/SPARK-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341746#comment-14341746 ] Nicholas Chammas commented on SPARK-6084: - cc [~pwendell], [~andrewor14] I haven't confirmed this issue myself. Just forwarding along the report I saw on Stack Overflow. spark-shell broken on Windows - Key: SPARK-6084 URL: https://issues.apache.org/jira/browse/SPARK-6084 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.0, 1.2.1 Environment: Windows 7, Scala 2.11.4, Java 1.8 Reporter: Nicholas Chammas Labels: windows Original report here: http://stackoverflow.com/questions/28747795/spark-launch-find-version For both spark-1.2.0-bin-hadoop2.4 and spark-1.2.1-bin-hadoop2.4, doing this: {code} bin\spark-shell.cmd {code} Yields the following error: {code} find: 'version': No such file or directory else was unexpected at this time. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5971) Add Mesos support to spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-5971: Description: Right now, spark-ec2 can only launch Spark clusters that use the standalone manager. Adding support for Mesos would be useful mostly for automated performance testing of Spark on Mesos. was: Right now, spark-ec2 can only launching Spark clusters that use the standalone manager. Adding support to launch Spark-on-Mesos clusters would be useful mostly for automated performance testing of Spark on Mesos. Add Mesos support to spark-ec2 -- Key: SPARK-5971 URL: https://issues.apache.org/jira/browse/SPARK-5971 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Nicholas Chammas Priority: Minor Right now, spark-ec2 can only launch Spark clusters that use the standalone manager. Adding support for Mesos would be useful mostly for automated performance testing of Spark on Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces
[ https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335074#comment-14335074 ] Nicholas Chammas commented on SPARK-3850: - Ah I see. I'm fine with closing this issue if that's the case. I opened it mostly because of the linked discussions. But actually wouldn't this check also cover those data files? Scala style: disallow trailing spaces - Key: SPARK-3850 URL: https://issues.apache.org/jira/browse/SPARK-3850 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Nicholas Chammas Priority: Minor Background discussions: * https://github.com/apache/spark/pull/2619 * http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html If you look at [the PR Cheng opened|https://github.com/apache/spark/pull/2619], you'll see a trailing white space seemed to mess up some SQL test. That's what spurred the creation of this issue. [Ted Yu on the dev list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E] suggested using this [{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3850) Scala style: disallow trailing spaces
[ https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-3850: Description: Background discussions: * https://github.com/apache/spark/pull/2619 * http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html If you look at [the PR Cheng opened|https://github.com/apache/spark/pull/2619], you'll see a trailing white space seemed to mess up some SQL test. That's what spurred the creation of this issue. [Ted Yu on the dev list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E] suggested using this [{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html]. was:[Ted Yu on the dev list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E] suggested using {{WhitespaceEndOfLineChecker}} here: http://www.scalastyle.org/rules-0.1.0.html Priority: Minor (was: Major) Scala style: disallow trailing spaces - Key: SPARK-3850 URL: https://issues.apache.org/jira/browse/SPARK-3850 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Nicholas Chammas Priority: Minor Background discussions: * https://github.com/apache/spark/pull/2619 * http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html If you look at [the PR Cheng opened|https://github.com/apache/spark/pull/2619], you'll see a trailing white space seemed to mess up some SQL test. That's what spurred the creation of this issue. [Ted Yu on the dev list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E] suggested using this [{{WhitespaceEndOfLineChecker}}|http://www.scalastyle.org/rules-0.1.0.html]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3850) Scala style: disallow trailing spaces
[ https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335045#comment-14335045 ] Nicholas Chammas commented on SPARK-3850: - I guess the root is the [Style Guide|https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide]. This and the parent issue are simply meant for automating existing rules, not introducing new ones. As an aside, this particular rule doesn't seem to be mentioned in the style guide, but it was discussed in a couple of places: * https://github.com/apache/spark/pull/2619 * http://apache-spark-developers-list.1001551.n3.nabble.com/Extending-Scala-style-checks-td8624.html If you look at the PR [~lian cheng] opened, you'll see a trailing white space seemed to mess up some SQL test. Scala style: disallow trailing spaces - Key: SPARK-3850 URL: https://issues.apache.org/jira/browse/SPARK-3850 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Nicholas Chammas [Ted Yu on the dev list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E] suggested using {{WhitespaceEndOfLineChecker}} here: http://www.scalastyle.org/rules-0.1.0.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5971) Add Mesos support to spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-5971: Summary: Add Mesos support to spark-ec2 (was: Add support for launching Spark-on-Mesos clusters to spark-ec2 ) Add Mesos support to spark-ec2 -- Key: SPARK-5971 URL: https://issues.apache.org/jira/browse/SPARK-5971 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Nicholas Chammas Priority: Minor Right now, spark-ec2 can only launching Spark clusters that use the standalone manager. Adding support to launch Spark-on-Mesos clusters would be useful mostly for automated performance testing of Spark on Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5971) Add support for launching Spark-on-Mesos clusters to spark-ec2
Nicholas Chammas created SPARK-5971: --- Summary: Add support for launching Spark-on-Mesos clusters to spark-ec2 Key: SPARK-5971 URL: https://issues.apache.org/jira/browse/SPARK-5971 Project: Spark Issue Type: New Feature Components: EC2 Reporter: Nicholas Chammas Priority: Minor Right now, spark-ec2 can only launching Spark clusters that use the standalone manager. Adding support to launch Spark-on-Mesos clusters would be useful mostly for automated performance testing of Spark on Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3674) Add support for launching YARN clusters in spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335199#comment-14335199 ] Nicholas Chammas commented on SPARK-3674: - There is an open PR for this here: https://github.com/mesos/spark-ec2/pull/77 Add support for launching YARN clusters in spark-ec2 Key: SPARK-3674 URL: https://issues.apache.org/jira/browse/SPARK-3674 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Right now spark-ec2 only supports launching Spark Standalone clusters. While this is sufficient for basic usage it is hard to test features or do performance benchmarking on YARN. It will be good to add support for installing, configuring a Apache YARN cluster at a fixed version -- say the latest stable version 2.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs
[ https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14335573#comment-14335573 ] Nicholas Chammas commented on SPARK-5312: - It's something to consider I guess. Spark provides strong guarantees about API stability and the like. Making it easy for reviewers to catch changes to public classes is supposed to help with that. What we have is perhaps good for now, and perhaps the foreseeable future. So maybe we should resolve this issue for now and just keep it in mind in the future. cc [~pwendell] Use sbt to detect new or changed public classes in PRs -- Key: SPARK-5312 URL: https://issues.apache.org/jira/browse/SPARK-5312 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Nicholas Chammas Priority: Minor We currently use an [unwieldy grep/sed contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] to detect new public classes in PRs. Apparently, sbt lets you get a list of public classes [much more directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via {{show compile:discoveredMainClasses}}. We should use that instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Posting to the list
Nabble is a third-party site. If you send stuff through Nabble, Nabble has to forward it along to the Apache mailing list. If something goes wrong with that, you will have a message show up on Nabble that no-one saw. The reverse can also happen, where something actually goes out on the list and doesn't make it to Nabble. Nabble is a nicer, third-party interface to the Apache list archives. No more. It works best for reading through old threads. Apache is the source of truth. Post through there. Unfortunately, this is what we're stuck with. For a related discussion, see this thread about Discourse http://apache-spark-user-list.1001560.n3.nabble.com/Discourse-A-proposed-alternative-to-the-Spark-User-list-td20851.html . Nick On Sun Feb 22 2015 at 8:07:08 PM haihar nahak harihar1...@gmail.com wrote: I checked it but I didn't see any mail from user list. Let me do it one more time. [image: Inline image 1] --Harihar On Mon, Feb 23, 2015 at 11:50 AM, Ted Yu yuzhih...@gmail.com wrote: bq. i didnt get any new subscription mail in my inbox. Have you checked your Spam folder ? Cheers On Sun, Feb 22, 2015 at 2:36 PM, hnahak harihar1...@gmail.com wrote: I'm also facing the same issue, this is third time whenever I post anything it never accept by the community and at the same time got a failure mail in my register mail id. and when click to subscribe to this mailing list link, i didnt get any new subscription mail in my inbox. Please anyone suggest a best way to subscribed the email ID -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750p21756.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- {{{H2N}}}-(@:
Re: Launching Spark cluster on EC2 with Ubuntu AMI
I know that Spark EC2 scripts are not guaranteed to work with custom AMIs but still, it should work… Nope, it shouldn’t, unfortunately. The Spark base AMIs are custom-built for spark-ec2. No other AMI will work unless it was built with that goal in mind. Using a random AMI from the Amazon marketplace is unlikely to work because there are several tools and packages (e.g. like git) that need to be on the AMI. Furthermore, the spark-ec2 scripts all assume a yum-based Linux distribution, so you won’t be able to use Ubuntu (and apt-get-based distro) without some significant changes to the shell scripts used to build the AMI. There is some work ongoing as part of SPARK-3821 https://issues.apache.org/jira/browse/SPARK-3821 to make it easier to generate AMIs that work with spark-ec2. Nick On Sun Feb 22 2015 at 7:42:52 PM Ted Yu yuzhih...@gmail.com wrote: bq. bash: git: command not found Looks like the AMI doesn't have git pre-installed. Cheers On Sun, Feb 22, 2015 at 4:29 PM, olegshirokikh o...@solver.com wrote: I'm trying to launch Spark cluster on AWS EC2 with custom AMI (Ubuntu) using the following: ./ec2/spark-ec2 --key-pair=*** --identity-file='/home/***.pem' --region=us-west-2 --zone=us-west-2b --spark-version=1.2.1 --slaves=2 --instance-type=t2.micro --ami=ami-29ebb519 --user=ubuntu launch spark-ubuntu-cluster Everything starts OK and instances are launched: Found 1 master(s), 2 slaves Waiting for all instances in cluster to enter 'ssh-ready' state. Generating cluster's SSH key on master. But then I'm getting the following SSH errors until it stops trying and quits: bash: git: command not found Connection to ***.us-west-2.compute.amazonaws.com closed. Error executing remote command, retrying after 30 seconds: Command '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/home/***t.pem', '-o', 'UserKnownHostsFile=/dev/null', '-t', '-t', u'ubuntu@***.us-west-2.compute.amazonaws.com', 'rm -rf spark-ec2 git clone https://github.com/mesos/spark-ec2.git -b v4']' returned non-zero exit status 127 I know that Spark EC2 scripts are not guaranteed to work with custom AMIs but still, it should work... Any advice would be greatly appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Launching-Spark-cluster-on-EC2-with-Ubuntu-AMI-tp21757.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[jira] [Commented] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing
[ https://issues.apache.org/jira/browse/SPARK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14333496#comment-14333496 ] Nicholas Chammas commented on SPARK-5944: - I'm not sure, but I think [here in the root POM|https://github.com/apache/spark/blob/242d49584c6aa21d928db2552033661950f760a5/pom.xml#L29] is where you can programmatically fetch the release version. (cc [~srowen] for verification) Also, we should update the [release checklist|https://cwiki.apache.org/confluence/display/SPARK/Preparing+Spark+Releases#PreparingSparkReleases-PreparingSparkforRelease] so this isn't missed again. Maybe this is something that goes in [this audit script|https://github.com/apache/spark/blob/master/dev/audit-release/audit_release.py]? (cc [~pwendell]) Python release docs say SNAPSHOT + Author is missing Key: SPARK-5944 URL: https://issues.apache.org/jira/browse/SPARK-5944 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.2.1 Reporter: Nicholas Chammas Priority: Minor http://spark.apache.org/docs/latest/api/python/index.html As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 1.2.1. Furthermore, in the footer it says Copyright 2014, Author. It should probably say something something else or be removed altogether. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing
[ https://issues.apache.org/jira/browse/SPARK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-5944: Target Version/s: 1.2.2 Python release docs say SNAPSHOT + Author is missing Key: SPARK-5944 URL: https://issues.apache.org/jira/browse/SPARK-5944 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.2.1 Reporter: Nicholas Chammas Priority: Minor http://spark.apache.org/docs/latest/api/python/index.html As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 1.2.1. Furthermore, in the footer it says Copyright 2014, Author. It should probably say something something else or be removed altogether. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7
The first concern for Spark will probably be to ensure that we still build and test against Python 2.6, since that's the minimum version of Python we support. Otherwise this seems OK. We use numpy and other Python packages in PySpark, but I don't think we're pinned to any particular version of those packages. Nick On Mon Feb 23 2015 at 2:15:19 PM shane knapp skn...@berkeley.edu wrote: good morning, developers! TL;DR: i will be installing anaconda and setting it in the system PATH so that your python will default to 2.7, as well as it taking over management of all of the sci-py packages. this is potentially a big change, so i'll be testing locally on my staging instance before deployment to the wide world. deployment is *tentatively* next monday, march 2nd. a little background: the jenkins test infra is currently (and happily) managed by a set of tools that allow me to set up and deploy new workers, manage their packages and make sure that all spark and research projects can happily and successfully build. we're currently at the state where ~50 or so packages are installed and configured on each worker. this is getting a little cumbersome, as the package-to-build dep tree is getting pretty large. the biggest offender is the science-based python infrastructure. everything is blindly installed w/yum and pip, so it's hard to control *exactly* what version of any given library is as compared to what's on a dev's laptop. the solution: anaconda (https://store.continuum.io/cshop/anaconda/)! everything is centralized! i can manage specific versions much easier! what this means to you: * python 2.7 will be the default system python. * 2.6 will still be installed and available (/usr/bin/python or /usr/bin/python/2.6) what you need to do: * install anaconda, have it update your PATH * build locally and try to fix any bugs (for spark, this should just work) * if you have problems, reach out to me and i'll see what i can do to help. if we can't get your stuff running under python2.7, we can default to 2.6 via a job config change. what i will be doing: * setting up anaconda on my staging instance and spot-testing a lot of builds before deployment please let me know if there are any issues/concerns... i'll be posting updates this week and will let everyone know if there are any changes to the Plan[tm]. your friendly devops engineer, shane
[jira] [Commented] (SPARK-4123) Show new dependencies added in pull requests
[ https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334352#comment-14334352 ] Nicholas Chammas commented on SPARK-4123: - Go ahead! I haven't done anything for this yet. Show new dependencies added in pull requests Key: SPARK-4123 URL: https://issues.apache.org/jira/browse/SPARK-4123 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Priority: Critical We should inspect the classpath of Spark's assembly jar for every pull request. This only takes a few seconds in Maven and it will help weed out dependency changes from the master branch. Ideally we'd post any dependency changes in the pull request message. {code} $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk -F/ '{print $NF}' | sort my-classpath $ git checkout apache/master $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk -F/ '{print $NF}' | sort master-classpath $ diff my-classpath master-classpath chill-java-0.3.6.jar chill_2.10-0.3.6.jar --- chill-java-0.5.0.jar chill_2.10-0.5.0.jar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3850) Scala style: disallow trailing spaces
[ https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334351#comment-14334351 ] Nicholas Chammas edited comment on SPARK-3850 at 2/24/15 5:16 AM: -- {quote} enabled=false {quote} Per the parent issue SPARK-3849, I believe this issue is about enabling this rule in a non-intrusive way. So I think we still need this issue. was (Author: nchammas): {quote} enabled=false {quote} Per the parent issue SPARK-3849, I believe this issue about enabling this rule in a non-intrusive way. So I think we still need this issue. Scala style: disallow trailing spaces - Key: SPARK-3850 URL: https://issues.apache.org/jira/browse/SPARK-3850 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: Nicholas Chammas [Ted Yu on the dev list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E] suggested using {{WhitespaceEndOfLineChecker}} here: http://www.scalastyle.org/rules-0.1.0.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs
[ https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14334355#comment-14334355 ] Nicholas Chammas commented on SPARK-5312: - Yeah, this is not a priority really. I looked into sbt and agree it's probably not suited to the task. I found something else that looks interesting: http://software.clapper.org/classutil/ But I don't have time to look into it. Use sbt to detect new or changed public classes in PRs -- Key: SPARK-5312 URL: https://issues.apache.org/jira/browse/SPARK-5312 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Nicholas Chammas Priority: Minor We currently use an [unwieldy grep/sed contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] to detect new public classes in PRs. Apparently, sbt lets you get a list of public classes [much more directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via {{show compile:discoveredMainClasses}}. We should use that instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4958) Bake common tools like ganglia into Spark AMI
[ https://issues.apache.org/jira/browse/SPARK-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-4958. - Resolution: Duplicate Fix Version/s: (was: 1.3.0) Closing this as a duplicate of SPARK-3821 since we're covering the addition of stuff like Ganglia to the AMIs in that issue. Bake common tools like ganglia into Spark AMI - Key: SPARK-4958 URL: https://issues.apache.org/jira/browse/SPARK-4958 Project: Spark Issue Type: Sub-task Components: EC2 Reporter: Nicholas Chammas Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Improving metadata in Spark JIRA
Open pull request count is down to 254 right now from ~325 several weeks ago. This great. Ideally, we need to get this down to 50 and keep it there. Having so many open pull requests is just a bad signal to contributors. But it will take some time to get there. - 1+ Component Sean, do you have permission to edit our JIRA settings? It should be possible to enforce this in JIRA itself. - 1+ Affects version I don’t think this field makes sense for improvements, right? Nick On Sun Feb 22 2015 at 9:43:24 AM Sean Owen so...@cloudera.com wrote: Open pull request count is down to 254 right now from ~325 several weeks ago. Open JIRA count is down slightly to 1262 from a peak over ~1320. Obviously, in the face of an ever faster and larger stream of contributions. There's a real positive impact of JIRA being a little more meaningful, a little less backlog to keep looking at, getting commits in slightly faster, slightly happier contributors, etc. The virtuous circle can keep going. It'd be great if every contributor could take a moment to look at his or her open PRs and JIRAs. Example searches (replace with your user name / name): https://github.com/apache/spark/pulls/srowen https://issues.apache.org/jira/issues/?jql=project%20% 3D%20SPARK%20AND%20reporter%20%3D%20%22Sean%20Owen%22% 20or%20assignee%20%3D%20%22Sean%20Owen%22 For PRs: - if it appears to be waiting on your action or feedback, - push more changes and/or reply to comments, or - if it isn't work you can pursue in the immediate future, close the PR - if it appears to be waiting on others, - if it's had feedback and it's unclear whether there's support to commit as-is, - break down or reduce the change to something less controversial - close the PR as softly rejected - if there's no feedback or plainly waiting for action, ping @them For JIRAs: - If it's fixed along the way, or obsolete, resolve as Fixed or NotAProblem - Do a quick search to see if a similar issue has been filed and is resolved or has more activity; resolve as Duplicate if so - Check that fields are assigned reasonably: - Meaningful title and description - Reasonable type and priority. Not everything is a major bug, and few are blockers - 1+ Component - 1+ Affects version - Avoid setting target version until it looks like there's momentum to merge a resolution - If the JIRA has had no activity in a long time (6+ months), but does not feel obsolete, try to move it to some resolution: - Request feedback, from specific people if desired, to feel out if there is any other support for the change - Add more info, like a specific reproduction for bugs - Narrow scope of feature requests to something that contains a few actionable steps, instead of broad open-ended wishes - Work on a fix. In an ideal world people are willing to work to resolve JIRAs they open, and don't fire-and-forget If everyone did this, not only would it advance the house-cleaning a bit more, but I'm sure we'd rediscover some important work and issues that need attention. On Sun, Feb 22, 2015 at 7:54 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: As of right now, there are no more open JIRA issues without an assigned component https://issues.apache.org/jira/issues/?jql=project%20% 3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND% 20component%20%3D%20EMPTY%20ORDER%20BY%20updated%20DESC! Hurray! [image: yay] Thanks to Sean and others for the cleanup! Nick
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332303#comment-14332303 ] Nicholas Chammas commented on SPARK-3821: - For those wanting to use the work being done as part of this issue before it gets merged upstream, I posted some [instructions on Stack Overflow|http://stackoverflow.com/a/28639669/877069] in response to a related question. Develop an automated way of creating Spark images (AMI, Docker, and others) --- Key: SPARK-3821 URL: https://issues.apache.org/jira/browse/SPARK-3821 Project: Spark Issue Type: Improvement Components: Build, EC2 Reporter: Nicholas Chammas Assignee: Nicholas Chammas Attachments: packer-proposal.html Right now the creation of Spark AMIs or Docker containers is done manually. With tools like [Packer|http://www.packer.io/], we should be able to automate this work, and do so in such a way that multiple types of machine images can be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Git Achievements
For fun: http://acha-acha.co/#/repo/https://github.com/apache/spark I just added Spark to this site. Some of these “achievements” are hilarious. Leo Tolstoy: More than 10 lines in a commit message Dangerous Game: Commit after 6PM friday Nick
[jira] [Commented] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing
[ https://issues.apache.org/jira/browse/SPARK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332394#comment-14332394 ] Nicholas Chammas commented on SPARK-5944: - cc [~davies], [~joshrosen] Python release docs say SNAPSHOT + Author is missing Key: SPARK-5944 URL: https://issues.apache.org/jira/browse/SPARK-5944 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.2.1 Reporter: Nicholas Chammas Priority: Minor http://spark.apache.org/docs/latest/api/python/index.html As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 1.2.1. Furthermore, in the footer it says Copyright 2014, Author. It should probably say something something else or be removed altogether. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5944) Python release docs say SNAPSHOT + Author is missing
Nicholas Chammas created SPARK-5944: --- Summary: Python release docs say SNAPSHOT + Author is missing Key: SPARK-5944 URL: https://issues.apache.org/jira/browse/SPARK-5944 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.2.1 Reporter: Nicholas Chammas Priority: Minor http://spark.apache.org/docs/latest/api/python/index.html As of Feb 2015, that link says PySpark 1.2-SNAPSHOT. It should probably say 1.2.1. Furthermore, in the footer it says Copyright 2014, Author. It should probably say something something else or be removed altogether. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-765) Test suite should run Spark example programs
[ https://issues.apache.org/jira/browse/SPARK-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332438#comment-14332438 ] Nicholas Chammas commented on SPARK-765: Seems like a good idea. [~joshrosen] I assume this is still to be done, right? Test suite should run Spark example programs Key: SPARK-765 URL: https://issues.apache.org/jira/browse/SPARK-765 Project: Spark Issue Type: New Feature Components: Examples Reporter: Josh Rosen The Spark test suite should also run each of the Spark example programs (the PySpark suite should do the same). This should be done through a shell script or other mechanism to simulate the environment setup used by end users that run those scripts. This would prevent problems like SPARK-764 from making it into releases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org