[jira] [Comment Edited] (SPARK-1532) provide option for more restrictive firewall rule in ec2/spark_ec2.py

2014-04-18 Thread Art Peel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973819#comment-13973819
 ] 

Art Peel edited comment on SPARK-1532 at 4/19/14 6:11 AM:
--

https://github.com/apache/spark/pull/445 

(subsequently closed and replaced by https://github.com/apache/spark/pull/453 ) 


was (Author: foundart):
https://github.com/apache/spark/pull/445

> provide option for more restrictive firewall rule in ec2/spark_ec2.py
> -
>
> Key: SPARK-1532
> URL: https://issues.apache.org/jira/browse/SPARK-1532
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 0.9.0
>Reporter: Art Peel
>Priority: Minor
>
> When ec2/spark_ec2.py sets up firewall rules for various ports, it uses an 
> extremely lenient hard-coded value for allowed IP addresses: '0.0.0.0/0'
> It would be very useful for deployments to allow specifying the allowed IP 
> addresses as a command-line option to ec2/spark_ec2.py.  This new 
> configuration parameter should have as its default the current hard-coded 
> value, '0.0.0.0/0', so the functionality of ec2/spark_ec2.py will change only 
> for those users who specify the new option.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1532) provide option for more restrictive firewall rule in ec2/spark_ec2.py

2014-04-18 Thread Art Peel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974747#comment-13974747
 ] 

Art Peel commented on SPARK-1532:
-

the original pull request failed on Travis-CI due to a timeout compiling scala 
code.  That seems extremely unlikely to have resulted from my changes to 
ec2/spark_ec2.py so I have generated a new pull request: 
https://github.com/apache/spark/pull/453

> provide option for more restrictive firewall rule in ec2/spark_ec2.py
> -
>
> Key: SPARK-1532
> URL: https://issues.apache.org/jira/browse/SPARK-1532
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 0.9.0
>Reporter: Art Peel
>Priority: Minor
>
> When ec2/spark_ec2.py sets up firewall rules for various ports, it uses an 
> extremely lenient hard-coded value for allowed IP addresses: '0.0.0.0/0'
> It would be very useful for deployments to allow specifying the allowed IP 
> addresses as a command-line option to ec2/spark_ec2.py.  This new 
> configuration parameter should have as its default the current hard-coded 
> value, '0.0.0.0/0', so the functionality of ec2/spark_ec2.py will change only 
> for those users who specify the new option.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1482) Potential resource leaks in saveAsHadoopDataset and saveAsNewAPIHadoopDataset

2014-04-18 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1482:
-

Assignee: Shixiong Zhu

> Potential resource leaks in saveAsHadoopDataset and saveAsNewAPIHadoopDataset
> -
>
> Key: SPARK-1482
> URL: https://issues.apache.org/jira/browse/SPARK-1482
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>  Labels: easyfix
> Fix For: 1.0.0
>
>
> "writer.close" should be put in the "finally" block to avoid potential 
> resource leaks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1482) Potential resource leaks in saveAsHadoopDataset and saveAsNewAPIHadoopDataset

2014-04-18 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1482.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

https://github.com/apache/spark/pull/400

> Potential resource leaks in saveAsHadoopDataset and saveAsNewAPIHadoopDataset
> -
>
> Key: SPARK-1482
> URL: https://issues.apache.org/jira/browse/SPARK-1482
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>  Labels: easyfix
> Fix For: 1.0.0
>
>
> "writer.close" should be put in the "finally" block to avoid potential 
> resource leaks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1538) SparkUI forgets about all persisted RDD's not directly associated with stages

2014-04-18 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1538:
-

Summary: SparkUI forgets about all persisted RDD's not directly associated 
with stages  (was: SparkUI forgets about all persisted RDD's not associated 
with stages)

> SparkUI forgets about all persisted RDD's not directly associated with stages
> -
>
> Key: SPARK-1538
> URL: https://issues.apache.org/jira/browse/SPARK-1538
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.1
>Reporter: Andrew Or
>Priority: Blocker
> Fix For: 1.0.0
>
>
> The following command creates two RDDs in one Stage:
> sc.parallelize(1 to 1000, 4).persist.map(_ + 1).count
> More specifically, parallelize creates one, and map creates another. If we 
> persist only the first one, it does not actually show up on the StorageTab of 
> the SparkUI.
> This is because StageInfo only keeps around information for the last RDD 
> associated with the stage, but forgets about all of its parents. The proposal 
> here is to have StageInfo climb the RDD dependency ladder to keep a list of 
> all associated RDDInfos.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1538) SparkUI forgets about all persisted RDD's not directly associated with the Stage

2014-04-18 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1538:
-

Summary: SparkUI forgets about all persisted RDD's not directly associated 
with the Stage  (was: SparkUI forgets about all persisted RDD's not directly 
associated with stages)

> SparkUI forgets about all persisted RDD's not directly associated with the 
> Stage
> 
>
> Key: SPARK-1538
> URL: https://issues.apache.org/jira/browse/SPARK-1538
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.1
>Reporter: Andrew Or
>Priority: Blocker
> Fix For: 1.0.0
>
>
> The following command creates two RDDs in one Stage:
> sc.parallelize(1 to 1000, 4).persist.map(_ + 1).count
> More specifically, parallelize creates one, and map creates another. If we 
> persist only the first one, it does not actually show up on the StorageTab of 
> the SparkUI.
> This is because StageInfo only keeps around information for the last RDD 
> associated with the stage, but forgets about all of its parents. The proposal 
> here is to have StageInfo climb the RDD dependency ladder to keep a list of 
> all associated RDDInfos.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1538) SparkUI forgets about all persisted RDD's not associated with stages

2014-04-18 Thread Andrew Or (JIRA)
Andrew Or created SPARK-1538:


 Summary: SparkUI forgets about all persisted RDD's not associated 
with stages
 Key: SPARK-1538
 URL: https://issues.apache.org/jira/browse/SPARK-1538
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1
Reporter: Andrew Or
Priority: Blocker
 Fix For: 1.0.0


The following command creates two RDDs in one Stage:

sc.parallelize(1 to 1000, 4).persist.map(_ + 1).count

More specifically, parallelize creates one, and map creates another. If we 
persist only the first one, it does not actually show up on the StorageTab of 
the SparkUI.

This is because StageInfo only keeps around information for the last RDD 
associated with the stage, but forgets about all of its parents. The proposal 
here is to have StageInfo climb the RDD dependency ladder to keep a list of all 
associated RDDInfos.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2014-04-18 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-1537:
-

 Summary: Add integration with Yarn's Application Timeline Server
 Key: SPARK-1537
 URL: https://issues.apache.org/jira/browse/SPARK-1537
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Reporter: Marcelo Vanzin


It would be nice to have Spark integrate with Yarn's Application Timeline 
Server (see YARN-321, YARN-1530). This would allow users running Spark on Yarn 
to have a single place to go for all their history needs, and avoid having to 
manage a separate service (Spark's built-in server).

At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
although there is still some ongoing work. But the basics are there, and I 
wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1536) Add multiclass classification support to MLlib

2014-04-18 Thread Manish Amde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Amde updated SPARK-1536:
---

Description: 
The current decision tree implementation in MLlib only supports binary 
classification. This task involves adding multiclass classification support to 
the decision tree implementation.

The tasks involves:
- Choosing a good strategy for multiclass classification among multiple options:
  -- add multi class support to impurity but it won't work well with the 
categorical features since the centriod-based ordering assumptions won't hold 
true
  -- error-correcting output codes
  -- one-vs-all
- Code implementation
- Unit tests
- Functional tests
- Performance tests
- Documentation


  was:
The current decision tree implementation in MLlib only supports binary 
classification. This task involves adding multiclass classification support to 
the decision tree implementation.

The tasks involves:
- Finding the best strategy for multiclass classification among multiple 
options:
  -- add multi class support to impurity but it won't work well with the 
categorical features since the centriod-based ordering assumptions won't hold 
true
  -- error-correcting output codes
  -- one-vs-all
- Code implementation
- Unit tests
- Functional tests
- Performance tests
- Documentation



> Add multiclass classification support to MLlib
> --
>
> Key: SPARK-1536
> URL: https://issues.apache.org/jira/browse/SPARK-1536
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 0.9.0
>Reporter: Manish Amde
>
> The current decision tree implementation in MLlib only supports binary 
> classification. This task involves adding multiclass classification support 
> to the decision tree implementation.
> The tasks involves:
> - Choosing a good strategy for multiclass classification among multiple 
> options:
>   -- add multi class support to impurity but it won't work well with the 
> categorical features since the centriod-based ordering assumptions won't hold 
> true
>   -- error-correcting output codes
>   -- one-vs-all
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1536) Add multiclass classification support to MLlib

2014-04-18 Thread Manish Amde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Amde updated SPARK-1536:
---

Description: 
The current decision tree implementation in MLlib only supports binary 
classification. This task involves adding multiclass classification support to 
the decision tree implementation.

The tasks involves:
+ Finding the best strategy for multiclass classification among multiple 
options:
  - add multi class support to impurity but it won't work well with the 
categorical features since the centriod-based ordering assumptions won't hold 
true
  - error-correcting output codes
  - one-vs-all
+ Code implementation
+ Unit tests
+ Functional tests
+ Performance tests
+ Documentation


  was:
The current decision tree implementation in MLlib only supports binary 
classification. This task involves adding multiclass classification support to 
the decision tree implementation.



> Add multiclass classification support to MLlib
> --
>
> Key: SPARK-1536
> URL: https://issues.apache.org/jira/browse/SPARK-1536
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 0.9.0
>Reporter: Manish Amde
>
> The current decision tree implementation in MLlib only supports binary 
> classification. This task involves adding multiclass classification support 
> to the decision tree implementation.
> The tasks involves:
> + Finding the best strategy for multiclass classification among multiple 
> options:
>   - add multi class support to impurity but it won't work well with the 
> categorical features since the centriod-based ordering assumptions won't hold 
> true
>   - error-correcting output codes
>   - one-vs-all
> + Code implementation
> + Unit tests
> + Functional tests
> + Performance tests
> + Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1536) Add multiclass classification support to MLlib

2014-04-18 Thread Manish Amde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Amde updated SPARK-1536:
---

Description: 
The current decision tree implementation in MLlib only supports binary 
classification. This task involves adding multiclass classification support to 
the decision tree implementation.

The tasks involves:
- Finding the best strategy for multiclass classification among multiple 
options:
  -- add multi class support to impurity but it won't work well with the 
categorical features since the centriod-based ordering assumptions won't hold 
true
  -- error-correcting output codes
  -- one-vs-all
- Code implementation
- Unit tests
- Functional tests
- Performance tests
- Documentation


  was:
The current decision tree implementation in MLlib only supports binary 
classification. This task involves adding multiclass classification support to 
the decision tree implementation.

The tasks involves:
- Finding the best strategy for multiclass classification among multiple 
options:
  -- add multi class support to impurity but it won't work well with the 
categorical features since the centriod-based ordering assumptions won't hold 
true
  - error-correcting output codes
  - one-vs-all
+ Code implementation
+ Unit tests
+ Functional tests
+ Performance tests
+ Documentation



> Add multiclass classification support to MLlib
> --
>
> Key: SPARK-1536
> URL: https://issues.apache.org/jira/browse/SPARK-1536
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 0.9.0
>Reporter: Manish Amde
>
> The current decision tree implementation in MLlib only supports binary 
> classification. This task involves adding multiclass classification support 
> to the decision tree implementation.
> The tasks involves:
> - Finding the best strategy for multiclass classification among multiple 
> options:
>   -- add multi class support to impurity but it won't work well with the 
> categorical features since the centriod-based ordering assumptions won't hold 
> true
>   -- error-correcting output codes
>   -- one-vs-all
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1536) Add multiclass classification support to MLlib

2014-04-18 Thread Manish Amde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Amde updated SPARK-1536:
---

Description: 
The current decision tree implementation in MLlib only supports binary 
classification. This task involves adding multiclass classification support to 
the decision tree implementation.

The tasks involves:
- Finding the best strategy for multiclass classification among multiple 
options:
  - add multi class support to impurity but it won't work well with the 
categorical features since the centriod-based ordering assumptions won't hold 
true
  - error-correcting output codes
  - one-vs-all
+ Code implementation
+ Unit tests
+ Functional tests
+ Performance tests
+ Documentation


  was:
The current decision tree implementation in MLlib only supports binary 
classification. This task involves adding multiclass classification support to 
the decision tree implementation.

The tasks involves:
+ Finding the best strategy for multiclass classification among multiple 
options:
  - add multi class support to impurity but it won't work well with the 
categorical features since the centriod-based ordering assumptions won't hold 
true
  - error-correcting output codes
  - one-vs-all
+ Code implementation
+ Unit tests
+ Functional tests
+ Performance tests
+ Documentation



> Add multiclass classification support to MLlib
> --
>
> Key: SPARK-1536
> URL: https://issues.apache.org/jira/browse/SPARK-1536
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 0.9.0
>Reporter: Manish Amde
>
> The current decision tree implementation in MLlib only supports binary 
> classification. This task involves adding multiclass classification support 
> to the decision tree implementation.
> The tasks involves:
> - Finding the best strategy for multiclass classification among multiple 
> options:
>   - add multi class support to impurity but it won't work well with the 
> categorical features since the centriod-based ordering assumptions won't hold 
> true
>   - error-correcting output codes
>   - one-vs-all
> + Code implementation
> + Unit tests
> + Functional tests
> + Performance tests
> + Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1536) Add multiclass classification support to MLlib

2014-04-18 Thread Manish Amde (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Amde updated SPARK-1536:
---

Description: 
The current decision tree implementation in MLlib only supports binary 
classification. This task involves adding multiclass classification support to 
the decision tree implementation.

The tasks involves:
- Finding the best strategy for multiclass classification among multiple 
options:
  -- add multi class support to impurity but it won't work well with the 
categorical features since the centriod-based ordering assumptions won't hold 
true
  - error-correcting output codes
  - one-vs-all
+ Code implementation
+ Unit tests
+ Functional tests
+ Performance tests
+ Documentation


  was:
The current decision tree implementation in MLlib only supports binary 
classification. This task involves adding multiclass classification support to 
the decision tree implementation.

The tasks involves:
- Finding the best strategy for multiclass classification among multiple 
options:
  - add multi class support to impurity but it won't work well with the 
categorical features since the centriod-based ordering assumptions won't hold 
true
  - error-correcting output codes
  - one-vs-all
+ Code implementation
+ Unit tests
+ Functional tests
+ Performance tests
+ Documentation



> Add multiclass classification support to MLlib
> --
>
> Key: SPARK-1536
> URL: https://issues.apache.org/jira/browse/SPARK-1536
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 0.9.0
>Reporter: Manish Amde
>
> The current decision tree implementation in MLlib only supports binary 
> classification. This task involves adding multiclass classification support 
> to the decision tree implementation.
> The tasks involves:
> - Finding the best strategy for multiclass classification among multiple 
> options:
>   -- add multi class support to impurity but it won't work well with the 
> categorical features since the centriod-based ordering assumptions won't hold 
> true
>   - error-correcting output codes
>   - one-vs-all
> + Code implementation
> + Unit tests
> + Functional tests
> + Performance tests
> + Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1536) Add multiclass classification support to MLlib

2014-04-18 Thread Manish Amde (JIRA)
Manish Amde created SPARK-1536:
--

 Summary: Add multiclass classification support to MLlib
 Key: SPARK-1536
 URL: https://issues.apache.org/jira/browse/SPARK-1536
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 0.9.0
Reporter: Manish Amde


The current decision tree implementation in MLlib only supports binary 
classification. This task involves adding multiclass classification support to 
the decision tree implementation.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1229) train on array (in addition to RDD)

2014-04-18 Thread Aliaksei Litouka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974521#comment-13974521
 ] 

Aliaksei Litouka commented on SPARK-1229:
-

May I start working on this issue? Please assign it to me.

> train on array (in addition to RDD)
> ---
>
> Key: SPARK-1229
> URL: https://issues.apache.org/jira/browse/SPARK-1229
> Project: Spark
>  Issue Type: Story
>  Components: MLlib
>Reporter: Arshak Navruzyan
>
> since predict method accepts either RDD or Array for consistency so should 
> train.  (particularly since RDD.takeSample() returns Array)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1184) Update the distribution tar.gz to include spark-assembly jar

2014-04-18 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover resolved SPARK-1184.


Resolution: Fixed

> Update the distribution tar.gz to include spark-assembly jar
> 
>
> Key: SPARK-1184
> URL: https://issues.apache.org/jira/browse/SPARK-1184
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 0.9.0
>
>
> This JIRA tracks 2 things:
> 1. There seems to be something going on in our assembly generation logic 
> because of which are two assembly jars.
> Something like:
> {code}spark-assembly_2.10-1.0.0-SNAPSHOT.jar{code}
> and 
> {code}spark-assembly_2.10-1.0.0-SNAPSHOT-hadoop2.0.5-alpha.jar{code}
> The former is pretty bogus and doesn't contain any class files and should be 
> gotten rid of. The latter contains all the good stuff. It essentially is the 
> uber jar generated by the maven-shade-plugin
> 2. The current bigtop-dist profile that builds the maven assembly (a .tar.gz 
> file) using the maven-assembly-plugin includes the bogus jar and not the 
> legit spark-assembly jar. We should get rid of the first one from this 
> assembly (which would happen when we fix #1) and put the legit uber jar in it.
> 3. Also, the bigtop-dist profile is meant to exclude the hadoop related jars 
> from the distribution. It does a good job of doing so for org.apache.hadoop 
> jars but misses the avro and zookeeper jars that are also provided by hadoop 
> land.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1184) Update the distribution tar.gz to include spark-assembly jar

2014-04-18 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974475#comment-13974475
 ] 

Mark Grover commented on SPARK-1184:


Committed quite a while ago:
https://github.com/apache/spark/commit/cda381f88cc03340fdf7b2d681699babbae2a56e

Resolving

> Update the distribution tar.gz to include spark-assembly jar
> 
>
> Key: SPARK-1184
> URL: https://issues.apache.org/jira/browse/SPARK-1184
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 0.9.0
>
>
> This JIRA tracks 2 things:
> 1. There seems to be something going on in our assembly generation logic 
> because of which are two assembly jars.
> Something like:
> {code}spark-assembly_2.10-1.0.0-SNAPSHOT.jar{code}
> and 
> {code}spark-assembly_2.10-1.0.0-SNAPSHOT-hadoop2.0.5-alpha.jar{code}
> The former is pretty bogus and doesn't contain any class files and should be 
> gotten rid of. The latter contains all the good stuff. It essentially is the 
> uber jar generated by the maven-shade-plugin
> 2. The current bigtop-dist profile that builds the maven assembly (a .tar.gz 
> file) using the maven-assembly-plugin includes the bogus jar and not the 
> legit spark-assembly jar. We should get rid of the first one from this 
> assembly (which would happen when we fix #1) and put the legit uber jar in it.
> 3. Also, the bigtop-dist profile is meant to exclude the hadoop related jars 
> from the distribution. It does a good job of doing so for org.apache.hadoop 
> jars but misses the avro and zookeeper jars that are also provided by hadoop 
> land.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (SPARK-1459) EventLoggingListener does not work with "file://" target dir

2014-04-18 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-1459:
---


Sorry, got confused. This PR is still pending.

> EventLoggingListener does not work with "file://" target dir
> 
>
> Key: SPARK-1459
> URL: https://issues.apache.org/jira/browse/SPARK-1459
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Marcelo Vanzin
>
> Bug is simple; FileLogger tries to pass a URL to FileOutputStream's 
> constructor, and that fails. I'll upload a PR soon.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1459) EventLoggingListener does not work with "file://" target dir

2014-04-18 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-1459.
---

Resolution: Fixed

This was commit 69047506. (If someone with permissions could set me as the 
assignee that would be great.)

> EventLoggingListener does not work with "file://" target dir
> 
>
> Key: SPARK-1459
> URL: https://issues.apache.org/jira/browse/SPARK-1459
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Marcelo Vanzin
>
> Bug is simple; FileLogger tries to pass a URL to FileOutputStream's 
> constructor, and that fails. I'll upload a PR soon.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1535) jblas's DoubleMatrix(double[]) ctor creates garbage; avoid

2014-04-18 Thread Tor Myklebust (JIRA)
Tor Myklebust created SPARK-1535:


 Summary: jblas's DoubleMatrix(double[]) ctor creates garbage; avoid
 Key: SPARK-1535
 URL: https://issues.apache.org/jira/browse/SPARK-1535
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 0.9.0
Reporter: Tor Myklebust
Priority: Trivial


The DoubleMatrix constructor that wraps a double[] and presents it as a row 
vector in jblas-1.2.3 new's a double[] and then immediately discards it.  It is 
straightforward to replace uses of this constructor with the (int, int, 
double...) constructor; perhaps this should be done until jblas-1.2.4 is 
released.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1523) improve the readability of code in AkkaUtil

2014-04-18 Thread Nan Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu resolved SPARK-1523.


   Resolution: Fixed
Fix Version/s: 1.1.0

> improve the readability of code in AkkaUtil 
> 
>
> Key: SPARK-1523
> URL: https://issues.apache.org/jira/browse/SPARK-1523
> Project: Spark
>  Issue Type: Improvement
>Reporter: Nan Zhu
>Assignee: Nan Zhu
>Priority: Trivial
> Fix For: 1.1.0
>
>
> Actually it is separated from https://github.com/apache/spark/pull/85 as 
> suggested by Reynold
> compare 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala#L122
>  and 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala#L117
> the first one use get and then toLong, the second one getLongbetter to 
> make them consistent
> very very small fix



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1483) Rename minSplits to minPartitions in public APIs

2014-04-18 Thread Nan Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu resolved SPARK-1483.


Resolution: Fixed

> Rename minSplits to minPartitions in public APIs
> 
>
> Key: SPARK-1483
> URL: https://issues.apache.org/jira/browse/SPARK-1483
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Nan Zhu
>Priority: Critical
> Fix For: 1.0.0
>
>
> The parameter name is part of the public API in Scala and Python, since you 
> can pass named parameters to a method, so we should name it to this more 
> descriptive term. Everywhere else we refer to "splits" as partitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1534) spark-submit for yarn prints warnings even though calling as expected

2014-04-18 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-1534:


 Summary: spark-submit for yarn prints warnings even though calling 
as expected 
 Key: SPARK-1534
 URL: https://issues.apache.org/jira/browse/SPARK-1534
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Thomas Graves


I am calling spark-submit to submit application to spark on yarn (cluster mode) 
and it is still printing warnings:

$ ./bin/spark-submit  
examples/target/scala-2.10/spark-examples_2.10-assembly-1.0.0-SNAPSHOT.jar  
--master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi 
--arg yarn-cluster --properties-file ./spark-conf.properties 
WARNING: This client is deprecated and will be removed in a future version of 
Spark.
Use ./bin/spark-submit with "--master yarn"
--args is deprecated. Use --arg instead.


The --args is deprecated is coming out because SparkSubmit itself needs to be 
updated to --arg. 

Similarly I think the Client.scala class for yarn needs to have the "Use 
./bin/spark-submit with "--master yarn"" warning removed since SparkSubmit also 
calls it directly.

I think the last one was supposed to warn users using spark-class directly. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1485) Implement AllReduce

2014-04-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1485:
-

Affects Version/s: (was: 1.0.0)

> Implement AllReduce
> ---
>
> Key: SPARK-1485
> URL: https://issues.apache.org/jira/browse/SPARK-1485
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> The current implementations of machine learning algorithms rely on the driver 
> for some computation and data broadcasting. This will create a bottleneck at 
> the driver for both computation and communication, especially in multi-model 
> training. An efficient implementation of AllReduce (or AllAggregate) can help 
> free the driver:
> allReduce(RDD[T], (T, T) => T): RDD[T]
> This JIRA is created for discussing how to implement AllReduce efficiently 
> and possible alternatives.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1533) The (kill) button in the web UI is visible to everyone.

2014-04-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1533:
-

Priority: Blocker  (was: Major)

> The (kill) button in the web UI is visible to everyone.
> ---
>
> Key: SPARK-1533
> URL: https://issues.apache.org/jira/browse/SPARK-1533
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> We can kill jobs from web UI now, which is great. But there is no 
> authentication in the standalone mode, e.g., clusters created by spark-ec2.
> Then everyone can visit a standalone server and kill jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1533) The (kill) button in the web UI is visible to everyone.

2014-04-18 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1533:


 Summary: The (kill) button in the web UI is visible to everyone.
 Key: SPARK-1533
 URL: https://issues.apache.org/jira/browse/SPARK-1533
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Xiangrui Meng


We can kill jobs from web UI now, which is great. But there is no 
authentication in the standalone mode, e.g., clusters created by spark-ec2.
Then everyone can visit a standalone server and kill jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)