[jira] [Resolved] (SPARK-2805) Update akka to version 2.3.4

2014-10-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2805.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Anand Avati

 Update akka to version 2.3.4
 

 Key: SPARK-2805
 URL: https://issues.apache.org/jira/browse/SPARK-2805
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati
Assignee: Anand Avati
 Fix For: 1.2.0


 akka-2.3 is the lowest version available in Scala 2.11
 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order 
 to reconcile the conflicting dependencies, need to release 
 akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2805) Update akka to version 2.3.4

2014-10-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2805:
---
Summary: Update akka to version 2.3.4  (was: Update akka to version 2.3)

 Update akka to version 2.3.4
 

 Key: SPARK-2805
 URL: https://issues.apache.org/jira/browse/SPARK-2805
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati
 Fix For: 1.2.0


 akka-2.3 is the lowest version available in Scala 2.11
 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order 
 to reconcile the conflicting dependencies, need to release 
 akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2805) Update akka to version 2.3

2014-10-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2805:
---
Summary: Update akka to version 2.3  (was: update akka to version 2.3)

 Update akka to version 2.3
 --

 Key: SPARK-2805
 URL: https://issues.apache.org/jira/browse/SPARK-2805
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati
 Fix For: 1.2.0


 akka-2.3 is the lowest version available in Scala 2.11
 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order 
 to reconcile the conflicting dependencies, need to release 
 akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3872) Rewrite the test for ActorInputStream.

2014-10-09 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-3872:
--

 Summary: Rewrite the test for ActorInputStream. 
 Key: SPARK-3872
 URL: https://issues.apache.org/jira/browse/SPARK-3872
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Prashant Sharma






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3872) Rewrite the test for ActorInputStream.

2014-10-09 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma reassigned SPARK-3872:
--

Assignee: Prashant Sharma

 Rewrite the test for ActorInputStream. 
 ---

 Key: SPARK-3872
 URL: https://issues.apache.org/jira/browse/SPARK-3872
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Prashant Sharma
Assignee: Prashant Sharma





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3873) Scala style: check import ordering

2014-10-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3873:
--

 Summary: Scala style: check import ordering
 Key: SPARK-3873
 URL: https://issues.apache.org/jira/browse/SPARK-3873
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3844) Truncate appName in WebUI if it is too long

2014-10-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3844.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Target Version/s: 1.1.1, 1.2.0  (was: 1.2.0)

 Truncate appName in WebUI if it is too long
 ---

 Key: SPARK-3844
 URL: https://issues.apache.org/jira/browse/SPARK-3844
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Trivial
 Fix For: 1.1.1, 1.2.0

 Attachments: long-title.png


 If `appName` is too long, it may move off the navbar. We can put the full 
 name inside `title` attribute while truncating the displayed name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3873) Scala style: check import ordering

2014-10-09 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164841#comment-14164841
 ] 

Patrick Wendell commented on SPARK-3873:


If we can do this it would be super, duper awesome.

 Scala style: check import ordering
 --

 Key: SPARK-3873
 URL: https://issues.apache.org/jira/browse/SPARK-3873
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3834) Backticks not correctly handled in subquery aliases

2014-10-09 Thread Ravindra Pesala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164857#comment-14164857
 ] 

Ravindra Pesala commented on SPARK-3834:


Ok [~marmbrus] , I will work on it.

 Backticks not correctly handled in subquery aliases
 ---

 Key: SPARK-3834
 URL: https://issues.apache.org/jira/browse/SPARK-3834
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Ravindra Pesala
Priority: Blocker

 [~ravi.pesala]  assigning to you since you fixed the last problem here.  Let 
 me know if you don't have time to work on this or if you have any questions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3874) Provide stable TaskContext API

2014-10-09 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3874:
--

 Summary: Provide stable TaskContext API
 Key: SPARK-3874
 URL: https://issues.apache.org/jira/browse/SPARK-3874
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Prashant Sharma


We made some improvements in SPARK-3543 but for Spark 1.2 we should convert 
TaskContext into a fully stable API. To do this I’d suggest the following 
changes - note that some of this reverses parts of SPARK-3543. The goal is to 
provide a class that users can’t easily construct and exposes only the public 
functionality.

1. Separate TaskContext into a public abstract class (TaskContext) and a 
private implementation called TaskContextImpl. The former should be a Java 
abstract class - the latter should be a private[spark] Scala class to reduce 
visibility (or maybe we can keep it as Java and tell people not to use it?).

2. TaskContext abstract class will have (NOTE: this changes getXX() to XX() 
intentionally)
public isCompleted()
public isInterrupted()
public addTaskCompletionListener(...)
public addTaskCompletionCallback(...) (deprecated)
public stageId()
public partitionId()
public attemptId()
pubic isRunningLocally()
STATIC
public get() 
set() and unset() at default visibility

3. A new private[spark] static object TaskContextHelper in the same package as 
TaskContext will exist to expose set() and unset() from within Spark using 
forwarder methods that just call TaskContext.set(). If someone within Spark 
wants to set this they call TaskContextHelper.set() and it forwards it.

4. TaskContextImpl will be used whenever we construct a TaskContext internally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3875) Add TEMP DIRECTORY configuration

2014-10-09 Thread Patrick Liu (JIRA)
Patrick Liu created SPARK-3875:
--

 Summary: Add TEMP DIRECTORY configuration
 Key: SPARK-3875
 URL: https://issues.apache.org/jira/browse/SPARK-3875
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Patrick Liu


Currently, the Spark uses java.io.tmpdir to find the /tmp/ directory.

Then, the /tmp/ directory is used to 
1. Setup the HTTP File Server
2. Broadcast directory
3. Fetch Dependency files or jars by Executors

The size of the /tmp/ directory will keep growing. The free space of the system 
disk will be less.

I think we could add a configuration spark.tmp.dir in conf/spark-env.sh or 
conf/spark-defaults.conf to set this particular directory. Let's say, set the 
directory to a data disk.
If spark.tmp.dir is not set, use the default java.io.tmpdir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3874) Provide stable TaskContext API

2014-10-09 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164874#comment-14164874
 ] 

Reynold Xin commented on SPARK-3874:


The proposal LGTM.

 Provide stable TaskContext API
 --

 Key: SPARK-3874
 URL: https://issues.apache.org/jira/browse/SPARK-3874
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Prashant Sharma

 We made some improvements in SPARK-3543 but for Spark 1.2 we should convert 
 TaskContext into a fully stable API. To do this I’d suggest the following 
 changes - note that some of this reverses parts of SPARK-3543. The goal is to 
 provide a class that users can’t easily construct and exposes only the public 
 functionality.
 1. Separate TaskContext into a public abstract class (TaskContext) and a 
 private implementation called TaskContextImpl. The former should be a Java 
 abstract class - the latter should be a private[spark] Scala class to reduce 
 visibility (or maybe we can keep it as Java and tell people not to use it?).
 2. TaskContext abstract class will have (NOTE: this changes getXX() to XX() 
 intentionally)
 public isCompleted()
 public isInterrupted()
 public addTaskCompletionListener(...)
 public addTaskCompletionCallback(...) (deprecated)
 public stageId()
 public partitionId()
 public attemptId()
 pubic isRunningLocally()
 STATIC
 public get() 
 set() and unset() at default visibility
 3. A new private[spark] static object TaskContextHelper in the same package 
 as TaskContext will exist to expose set() and unset() from within Spark using 
 forwarder methods that just call TaskContext.set(). If someone within Spark 
 wants to set this they call TaskContextHelper.set() and it forwards it.
 4. TaskContextImpl will be used whenever we construct a TaskContext 
 internally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3875) Add TEMP DIRECTORY configuration

2014-10-09 Thread Patrick Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164876#comment-14164876
 ] 

Patrick Liu commented on SPARK-3875:


https://github.com/apache/spark/pull/2729

 Add TEMP DIRECTORY configuration
 

 Key: SPARK-3875
 URL: https://issues.apache.org/jira/browse/SPARK-3875
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Patrick Liu

 Currently, the Spark uses java.io.tmpdir to find the /tmp/ directory.
 Then, the /tmp/ directory is used to 
 1. Setup the HTTP File Server
 2. Broadcast directory
 3. Fetch Dependency files or jars by Executors
 The size of the /tmp/ directory will keep growing. The free space of the 
 system disk will be less.
 I think we could add a configuration spark.tmp.dir in conf/spark-env.sh or 
 conf/spark-defaults.conf to set this particular directory. Let's say, set the 
 directory to a data disk.
 If spark.tmp.dir is not set, use the default java.io.tmpdir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3875) Add TEMP DIRECTORY configuration

2014-10-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164880#comment-14164880
 ] 

Apache Spark commented on SPARK-3875:
-

User 'kelepi' has created a pull request for this issue:
https://github.com/apache/spark/pull/2729

 Add TEMP DIRECTORY configuration
 

 Key: SPARK-3875
 URL: https://issues.apache.org/jira/browse/SPARK-3875
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Patrick Liu

 Currently, the Spark uses java.io.tmpdir to find the /tmp/ directory.
 Then, the /tmp/ directory is used to 
 1. Setup the HTTP File Server
 2. Broadcast directory
 3. Fetch Dependency files or jars by Executors
 The size of the /tmp/ directory will keep growing. The free space of the 
 system disk will be less.
 I think we could add a configuration spark.tmp.dir in conf/spark-env.sh or 
 conf/spark-defaults.conf to set this particular directory. Let's say, set the 
 directory to a data disk.
 If spark.tmp.dir is not set, use the default java.io.tmpdir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3875) Add TEMP DIRECTORY configuration

2014-10-09 Thread Patrick Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Liu updated SPARK-3875:
---
Comment: was deleted

(was: https://github.com/apache/spark/pull/2729)

 Add TEMP DIRECTORY configuration
 

 Key: SPARK-3875
 URL: https://issues.apache.org/jira/browse/SPARK-3875
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Patrick Liu

 Currently, the Spark uses java.io.tmpdir to find the /tmp/ directory.
 Then, the /tmp/ directory is used to 
 1. Setup the HTTP File Server
 2. Broadcast directory
 3. Fetch Dependency files or jars by Executors
 The size of the /tmp/ directory will keep growing. The free space of the 
 system disk will be less.
 I think we could add a configuration spark.tmp.dir in conf/spark-env.sh or 
 conf/spark-defaults.conf to set this particular directory. Let's say, set the 
 directory to a data disk.
 If spark.tmp.dir is not set, use the default java.io.tmpdir



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3845) SQLContext(...) should inherit configurations from SparkContext

2014-10-09 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164882#comment-14164882
 ] 

Jianshi Huang commented on SPARK-3845:
--

Looks like it's fixed in latest 1.2.0 snapshot.

In 1.1.0, sqlContext.getAllConfs returns empty map.

Jianshi

 SQLContext(...) should inherit configurations from SparkContext
 ---

 Key: SPARK-3845
 URL: https://issues.apache.org/jira/browse/SPARK-3845
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang

 It's very confusing that Spark configurations (e.g. spark.serializer, 
 spark.speculation, etc.) can be set in the spark-default.conf file, while 
 SparkSQL configurations (e..g spark.sql.inMemoryColumnarStorage.compressed, 
 spark.sql.codegen, etc.) has to be set either in sqlContext.setConf or 
 sql(SET ...).
 When I do:
   val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
 I would expect sqlContext recognizes all the SQL configurations comes with 
 sparkContext.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3158) Avoid 1 extra aggregation for DecisionTree training

2014-10-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3158.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2708
[https://github.com/apache/spark/pull/2708]

 Avoid 1 extra aggregation for DecisionTree training
 ---

 Key: SPARK-3158
 URL: https://issues.apache.org/jira/browse/SPARK-3158
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Qiping Li
 Fix For: 1.2.0


 Improvement: computation
 Currently, the implementation does one unnecessary aggregation step.  The 
 aggregation step for level L (to choose splits) gives enough information to 
 set the predictions of any leaf nodes at level L+1.  We can use that info and 
 skip the aggregation step for the last level of the tree (which only has leaf 
 nodes).
 This update could be done by:
 * allocating a root node before the loop in the main train() method
 * allocating nodes for level L+1 while choosing splits for level L
 * caching stats in these newly allocated nodes, so that we can calculate 
 predictions if we know they will be leaves
 * DecisionTree.findBestSplits can just return doneTraining
 This will let us cache impurity and avoid re-calculating it in 
 calculateGainForSplit.
 Some above notes were copied from discussion in 
 [https://github.com/apache/spark/pull/2341]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3876) Doing a RDD map/reduce within a DStream map fails with a high enough input rate

2014-10-09 Thread Andrei Filip (JIRA)
Andrei Filip created SPARK-3876:
---

 Summary: Doing a RDD map/reduce within a DStream map fails with a 
high enough input rate
 Key: SPARK-3876
 URL: https://issues.apache.org/jira/browse/SPARK-3876
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Andrei Filip


Having a custom receiver than generates random strings at custom rates: 
JavaRandomSentenceReceiver

A class that does work on a received string:

class LengthGetter implements Serializable{
public int getStrLength(String s){
return s.length();
}
}

The following code:

ListLengthGetter objList = Arrays.asList(new LengthGetter(), new 
LengthGetter(), new LengthGetter());

final JavaRDDLengthGetter objRdd = sc.parallelize(objList);


JavaInputDStreamString sentences = jssc.receiverStream(new 
JavaRandomSentenceReceiver(frequency));

sentences.map(new FunctionString, Integer() {

@Override
public Integer call(final String input) throws 
Exception {
Integer res = objRdd.map(new 
FunctionLengthGetter, Integer() {

@Override
public Integer call(LengthGetter lg) 
throws Exception {
return lg.getStrLength(input);
}
}).reduce(new Function2Integer, Integer, 
Integer() {

@Override
public Integer call(Integer left, 
Integer right) throws Exception {
return left + right;
}
});


return res;
}   
}).print();


fails for high enough frequencies with the following stack trace:

Exception in thread main org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 3.0:0 failed 1 times, most recent failure: Exception 
failure in TID 3 on host localhost: java.lang.NullPointerException
org.apache.spark.rdd.RDD.map(RDD.scala:270)
org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:72)
org.apache.spark.api.java.JavaRDD.map(JavaRDD.scala:29)



Other information that might be useful is that my current batch duration is set 
to 1sec and the frequencies for JavaRandomSentenceReceiver at which the 
application fails are as low as 2Hz (1Hz for example works)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3830) Implement genetic algorithms in MLLib

2014-10-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164979#comment-14164979
 ] 

Apache Spark commented on SPARK-3830:
-

User 'epahomov' has created a pull request for this issue:
https://github.com/apache/spark/pull/2731

 Implement genetic algorithms in MLLib
 -

 Key: SPARK-3830
 URL: https://issues.apache.org/jira/browse/SPARK-3830
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Egor Pakhomov
Assignee: Egor Pakhomov
Priority: Minor

 Implement evolutionary computation algorithm in MLLib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3854) Scala style: require spaces before `{`

2014-10-09 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164978#comment-14164978
 ] 

Kousuke Saruta commented on SPARK-3854:
---

[~joshrosen] I tried to write code to check spaces before '{' like as follows.

{code}
package org.apache.spark.scalastyle

import org.scalastyle.{PositionError, ScalariformChecker, ScalastyleError}
import scala.collection.mutable.{ListBuffer, Queue}
import scalariform.lexer.{Token, Tokens}
import scalariform.lexer.Tokens._
import scalariform.parser.CompilationUnit

class SparkSpaceBeforeLeftBraceChecker extends ScalariformChecker {
  val errorKey: String = insert.a.single.space.before.left.brace

  val rememberQueue: Queue[Token] = Queue[Token]()

  // The list of disallowed tokens before left brace without single space.
  val disallowedTokensBeforeLBrace = Seq (
ARROW, ELSE, OP, RPAREN, TRY, MATCH, NEW, DO, FINALLY, PACKAGE, RETURN, 
THROW, YIELD, VARID
  )

  override def verify(ast: CompilationUnit): List[ScalastyleError] = {

var list: ListBuffer[ScalastyleError] = new ListBuffer[ScalastyleError]

for (token - ast.tokens) {
  rememberToken(token)
  if (isLBrace(token) 
  isTokenAfterSpecificTokens(token) 
  !hasSingleWhiteSpaceBefore(token)) {
list += new PositionError(token.offset)
  }
}
list.toList
  }

  private def rememberToken(x: Token) = {
rememberQueue.enqueue(x)
if (rememberQueue.size  2) {
  rememberQueue.dequeue
}
x
  }

  private def isTokenAfterSpecificTokens(x: Token) = {
val previousToken = rememberQueue.head
disallowedTokensBeforeLBrace.contains(previousToken.tokenType)
  }

  private def isLBrace(x: Token) =
x.tokenType == Tokens.LBRACE

  private def hasSingleWhiteSpaceBefore(x: Token) =
x.associatedWhitespaceAndComments.whitespaces.size == 1
}
{code}

How does this look?

 Scala style: require spaces before `{`
 --

 Key: SPARK-3854
 URL: https://issues.apache.org/jira/browse/SPARK-3854
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Josh Rosen

 We should require spaces before opening curly braces.  This isn't in the 
 style guide, but it probably should be:
 {code}
 // Correct:
 if (true) {
   println(Wow!)
 }
 // Incorrect:
 if (true){
println(Wow!)
 }
 {code}
 See https://github.com/apache/spark/pull/1658#discussion-diff-18611791 for an 
 example in the wild.
 {{git grep ){}} shows only a few occurrences of this style.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails

2014-10-09 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-3877:
---

 Summary: The exit code of spark-submit is still 0 when an yarn 
application fails
 Key: SPARK-3877
 URL: https://issues.apache.org/jira/browse/SPARK-3877
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Shixiong Zhu
Priority: Minor


When an yarn application fails (yarn-cluster mode), the exit code of 
spark-submit is still 0. It's hard for people to write some automatic scripts 
to run spark jobs in yarn because the failure can not be detected in these 
scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3507) Create RegressionLearner trait and make some currect code implement it

2014-10-09 Thread Egor Pakhomov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pakhomov closed SPARK-3507.

Resolution: Duplicate

duplicate of SPARK-1856

 Create RegressionLearner trait and make some currect code implement it
 --

 Key: SPARK-3507
 URL: https://issues.apache.org/jira/browse/SPARK-3507
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Egor Pakhomov
Assignee: Egor Pakhomov
Priority: Minor
   Original Estimate: 168h
  Remaining Estimate: 168h

 Here in Yandex, during implementation of gradient boosting in spark and 
 creating our ML tool for internal use, we found next serious problems in 
 MLLib:
 There is no Regression/Classification learner model abstraction. We were 
 building abstract data processing pipelines, which should work just with some 
 regression - exact algorithm specified outside this code. There is no 
 abstraction, which will allow me to do that. (It's main reason for all 
 further problems) 
 There is no common practice among MLlib for testing algorithms: every model 
 generates it's own random test data. There is no easy extractable test cases 
 applible to another algorithm. There is no benchmarks for comparing 
 algorithms. After implementing new algorithm it's very hard to understand how 
 it should be tested.  
 Lack of serialization testing: MLlib algorithms don't contain tests which 
 test that model work after serialization.  
 During implementation of new algorithm it's hard to understand what API you 
 should create and which interface to implement.
 Start for solving all these problems must be done in creating common 
 interface for typical algorithms/models - regression, classification, 
 clustering, collaborative filtering.
 All main tests should be written against these interfaces, so when new 
 algorithm implemented - all it should do is passed already written tests. It 
 allow us to have managble quality among all lib.
 There should be couple benchmarks which allow new spark user to get feeling 
 about which algorithm to use.
 Test set against these abstractions should contain serialization test. In 
 production most time there is no need in model, which can't be stored.
 As the first step of this roadmap I'd like to create trait RegressionLearner, 
 ADD methods to current algorithms to implement this trait and create some 
 tests against it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails

2014-10-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164993#comment-14164993
 ] 

Apache Spark commented on SPARK-3877:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/2732

 The exit code of spark-submit is still 0 when an yarn application fails
 ---

 Key: SPARK-3877
 URL: https://issues.apache.org/jira/browse/SPARK-3877
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Shixiong Zhu
Priority: Minor
  Labels: yarn

 When an yarn application fails (yarn-cluster mode), the exit code of 
 spark-submit is still 0. It's hard for people to write some automatic scripts 
 to run spark jobs in yarn because the failure can not be detected in these 
 scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-09 Thread Yu Ishikawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-2429:
---
Attachment: The Result of Benchmarking a Hierarchical Clustering.pdf

Sorry for making some mistakes. I fixed them.

- Cluster Spec
- Typy mistakes

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: The Result of Benchmarking a Hierarchical 
 Clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-09 Thread Yu Ishikawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-2429:
---
Attachment: (was: The Result of Benchmarking a Hierarchical 
Clustering.pdf)

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3878) Benchmarks and common tests for mllib algorithm

2014-10-09 Thread Egor Pakhomov (JIRA)
Egor Pakhomov created SPARK-3878:


 Summary: Benchmarks and common tests for mllib algorithm
 Key: SPARK-3878
 URL: https://issues.apache.org/jira/browse/SPARK-3878
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Egor Pakhomov


There is no common practice among MLlib for testing algorithms: every model 
generates it's own random test data. There is no easy extractable test cases 
applible to another algorithm. There is no benchmarks for comparing algorithms. 
After implementing new algorithm it's very hard to understand how it should be 
tested. 
Lack of serialization testing: MLlib algorithms don't contain tests which test 
that model work after serialization. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails

2014-10-09 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165124#comment-14165124
 ] 

Thomas Graves commented on SPARK-3877:
--

this looks like a dup of SPARK-2167.  Or actually perhaps a subset of that 
since I think you only handle the yarn mode.   Does this cover both client and 
cluster mode?



 The exit code of spark-submit is still 0 when an yarn application fails
 ---

 Key: SPARK-3877
 URL: https://issues.apache.org/jira/browse/SPARK-3877
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Shixiong Zhu
Priority: Minor
  Labels: yarn

 When an yarn application fails (yarn-cluster mode), the exit code of 
 spark-submit is still 0. It's hard for people to write some automatic scripts 
 to run spark jobs in yarn because the failure can not be detected in these 
 scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3850) Scala style: disallow trailing spaces

2014-10-09 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-3850:

Summary: Scala style: disallow trailing spaces  (was: Scala style: Disallow 
trailing spaces)

 Scala style: disallow trailing spaces
 -

 Key: SPARK-3850
 URL: https://issues.apache.org/jira/browse/SPARK-3850
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Nicholas Chammas





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3879) spark-shell.cmd fails giving error !=x was unexpected at this time

2014-10-09 Thread Venkata Ramana G (JIRA)
Venkata Ramana G created SPARK-3879:
---

 Summary: spark-shell.cmd fails giving error !=x was unexpected at 
this time
 Key: SPARK-3879
 URL: https://issues.apache.org/jira/browse/SPARK-3879
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
 Environment: Windows
Reporter: Venkata Ramana G


spark-shell.cmd giving error !=x was unexpected at this time
This problem is introduced during SPARK-2058



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3879) spark-shell.cmd fails giving error !=x was unexpected at this time

2014-10-09 Thread Venkata Ramana G (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165209#comment-14165209
 ] 

Venkata Ramana G commented on SPARK-3879:
-

I have fixed the same, about to submit PR.

 spark-shell.cmd fails giving error !=x was unexpected at this time
 

 Key: SPARK-3879
 URL: https://issues.apache.org/jira/browse/SPARK-3879
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
 Environment: Windows
Reporter: Venkata Ramana G

 spark-shell.cmd giving error !=x was unexpected at this time
 This problem is introduced during SPARK-2058



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3879) spark-shell.cmd fails giving error !=x was unexpected at this time

2014-10-09 Thread Venkata Ramana G (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165225#comment-14165225
 ] 

Venkata Ramana G commented on SPARK-3879:
-

It was already fixed, under SPARK-3808. So can close this issue.

 spark-shell.cmd fails giving error !=x was unexpected at this time
 

 Key: SPARK-3879
 URL: https://issues.apache.org/jira/browse/SPARK-3879
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
 Environment: Windows
Reporter: Venkata Ramana G

 spark-shell.cmd giving error !=x was unexpected at this time
 This problem is introduced during SPARK-2058



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3879) spark-shell.cmd fails giving error !=x was unexpected at this time

2014-10-09 Thread Venkata Ramana G (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkata Ramana G closed SPARK-3879.
---
Resolution: Duplicate

 spark-shell.cmd fails giving error !=x was unexpected at this time
 

 Key: SPARK-3879
 URL: https://issues.apache.org/jira/browse/SPARK-3879
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
 Environment: Windows
Reporter: Venkata Ramana G

 spark-shell.cmd giving error !=x was unexpected at this time
 This problem is introduced during SPARK-2058



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3850) Scala style: disallow trailing spaces

2014-10-09 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-3850:

Description: [Ted Yu on the dev 
list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
 suggested using {{WhitespaceEndOfLineChecker}} here: 
http://www.scalastyle.org/rules-0.1.0.html

 Scala style: disallow trailing spaces
 -

 Key: SPARK-3850
 URL: https://issues.apache.org/jira/browse/SPARK-3850
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Nicholas Chammas

 [Ted Yu on the dev 
 list|http://mail-archives.apache.org/mod_mbox/spark-dev/201410.mbox/%3ccalte62y7a6wybdufdcguwbf8wcpttvie+pao4pzor+_-nb2...@mail.gmail.com%3E]
  suggested using {{WhitespaceEndOfLineChecker}} here: 
 http://www.scalastyle.org/rules-0.1.0.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2014-10-09 Thread Dev Lakhani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165293#comment-14165293
 ] 

Dev Lakhani commented on SPARK-3644:


Hi I am doing some work on the REST/JSON aspects and will be happy to take this 
on. Can someone assign it to me and/or help me get started?

We need to first draft out the various endpoints and document them somewhere.

Thanks
Dev

 REST API for Spark application info (jobs / stages / tasks / storage info)
 --

 Key: SPARK-3644
 URL: https://issues.apache.org/jira/browse/SPARK-3644
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Reporter: Josh Rosen

 This JIRA is a forum to draft a design proposal for a REST interface for 
 accessing information about Spark applications, such as job / stage / task / 
 storage status.
 There have been a number of proposals to serve JSON representations of the 
 information displayed in Spark's web UI.  Given that we might redesign the 
 pages of the web UI (and possibly re-implement the UI as a client of a REST 
 API), the API endpoints and their responses should be independent of what we 
 choose to display on particular web UI pages / layouts.
 Let's start a discussion of what a good REST API would look like from 
 first-principles.  We can discuss what urls / endpoints expose access to 
 data, how our JSON responses will be formatted, how fields will be named, how 
 the API will be documented and tested, etc.
 Some links for inspiration:
 https://developer.github.com/v3/
 http://developer.netflix.com/docs/REST_API_Reference
 https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib

2014-10-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1486:
-
Assignee: (was: Burak Yavuz)

 Support multi-model training in MLlib
 -

 Key: SPARK-1486
 URL: https://issues.apache.org/jira/browse/SPARK-1486
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Priority: Critical

 It is rare in practice to train just one model with a given set of 
 parameters. Usually, this is done by training multiple models with different 
 sets of parameters and then select the best based on their performance on the 
 validation set. MLlib should provide native support for multi-model 
 training/scoring. It requires decoupling of concepts like problem, 
 formulation, algorithm, parameter set, and model, which are missing in MLlib 
 now. MLI implements similar concepts, which we can borrow. There are 
 different approaches for multi-model training:
 0) Keep one copy of the data, and train models one after another (or maybe in 
 parallel, depending on the scheduler).
 1) Keep one copy of the data, and train multiple models at the same time 
 (similar to `runs` in KMeans).
 2) Make multiple copies of the data (still stored distributively), and use 
 more cores to distribute the work.
 3) Collect the data, make the entire dataset available on workers, and train 
 one or more models on each worker.
 Users should be able to choose which execution mode they want to use. Note 
 that 3) could cover many use cases in practice when the training data is not 
 huge, e.g., 1GB.
 This task will be divided into sub-tasks and this JIRA is created to discuss 
 the design and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib

2014-10-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1486:
-
Assignee: Burak Yavuz

 Support multi-model training in MLlib
 -

 Key: SPARK-1486
 URL: https://issues.apache.org/jira/browse/SPARK-1486
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Burak Yavuz
Priority: Critical

 It is rare in practice to train just one model with a given set of 
 parameters. Usually, this is done by training multiple models with different 
 sets of parameters and then select the best based on their performance on the 
 validation set. MLlib should provide native support for multi-model 
 training/scoring. It requires decoupling of concepts like problem, 
 formulation, algorithm, parameter set, and model, which are missing in MLlib 
 now. MLI implements similar concepts, which we can borrow. There are 
 different approaches for multi-model training:
 0) Keep one copy of the data, and train models one after another (or maybe in 
 parallel, depending on the scheduler).
 1) Keep one copy of the data, and train multiple models at the same time 
 (similar to `runs` in KMeans).
 2) Make multiple copies of the data (still stored distributively), and use 
 more cores to distribute the work.
 3) Collect the data, make the entire dataset available on workers, and train 
 one or more models on each worker.
 Users should be able to choose which execution mode they want to use. Note 
 that 3) could cover many use cases in practice when the training data is not 
 huge, e.g., 1GB.
 This task will be divided into sub-tasks and this JIRA is created to discuss 
 the design and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3880) HBase as data source to SparkSQL

2014-10-09 Thread Yan (JIRA)
Yan created SPARK-3880:
--

 Summary: HBase as data source to SparkSQL
 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
Reporter: Yan
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2014-10-09 Thread Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
  Component/s: SQL
Fix Version/s: (was: 1.3.0)

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-10-09 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165359#comment-14165359
 ] 

Nicholas Chammas commented on SPARK-3376:
-

[~matei], [~rxin], [~pwendell]: This is something to have on your radars, I 
believe.

 Memory-based shuffle strategy to reduce overhead of disk I/O
 

 Key: SPARK-3376
 URL: https://issues.apache.org/jira/browse/SPARK-3376
 Project: Spark
  Issue Type: Planned Work
Reporter: uncleGen
Priority: Trivial

 I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
 want to know is there any plan to do something about it. Or any suggestion 
 about it. Base on the work (SPARK-2044), it is feasible to have several 
 implementations of  shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2014-10-09 Thread Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
Attachment: HBaseOnSpark.docx

Design Document

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan
 Attachments: HBaseOnSpark.docx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2014-10-09 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165364#comment-14165364
 ] 

Daniel Darabos commented on SPARK-3644:
---

Hi Dev, thanks for the offer! Have you seen Kousuke's PR? 
https://github.com/apache/spark/pull/2333 seems to cover a lot of ground. Maybe 
he or the reviewers there can tell you how to make yourself useful!

Unrelatedly, I wanted to mention that you can disregard my earlier comments. We 
cannot use XHR on these endpoints, since a different port means a different 
security domain. And anyway it turned out to be really easy to use a custom 
SparkListener for what we wanted to do. Sorry for the noise.

 REST API for Spark application info (jobs / stages / tasks / storage info)
 --

 Key: SPARK-3644
 URL: https://issues.apache.org/jira/browse/SPARK-3644
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Reporter: Josh Rosen

 This JIRA is a forum to draft a design proposal for a REST interface for 
 accessing information about Spark applications, such as job / stage / task / 
 storage status.
 There have been a number of proposals to serve JSON representations of the 
 information displayed in Spark's web UI.  Given that we might redesign the 
 pages of the web UI (and possibly re-implement the UI as a client of a REST 
 API), the API endpoints and their responses should be independent of what we 
 choose to display on particular web UI pages / layouts.
 Let's start a discussion of what a good REST API would look like from 
 first-principles.  We can discuss what urls / endpoints expose access to 
 data, how our JSON responses will be formatted, how fields will be named, how 
 the API will be documented and tested, etc.
 Some links for inspiration:
 https://developer.github.com/v3/
 http://developer.netflix.com/docs/REST_API_Reference
 https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3881) This is a test JIRA

2014-10-09 Thread Spark User (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Spark User updated SPARK-3881:
--
Fix Version/s: (was: 1.2.0)
   1.1.1

 This is a test JIRA
 ---

 Key: SPARK-3881
 URL: https://issues.apache.org/jira/browse/SPARK-3881
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Spark User
 Fix For: 1.1.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3881) This is a test JIRA

2014-10-09 Thread Spark User (JIRA)
Spark User created SPARK-3881:
-

 Summary: This is a test JIRA
 Key: SPARK-3881
 URL: https://issues.apache.org/jira/browse/SPARK-3881
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Spark User
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3881) This is a test JIRA

2014-10-09 Thread Spark User (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Spark User updated SPARK-3881:
--
Fix Version/s: 1.3.0

 This is a test JIRA
 ---

 Key: SPARK-3881
 URL: https://issues.apache.org/jira/browse/SPARK-3881
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Spark User
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3881) This is a test JIRA

2014-10-09 Thread Spark User (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Spark User updated SPARK-3881:
--
Fix Version/s: (was: 1.1.1)

 This is a test JIRA
 ---

 Key: SPARK-3881
 URL: https://issues.apache.org/jira/browse/SPARK-3881
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Spark User
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3881) This is a test JIRA

2014-10-09 Thread Spark User (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Spark User updated SPARK-3881:
--
Fix Version/s: 1.3.0

 This is a test JIRA
 ---

 Key: SPARK-3881
 URL: https://issues.apache.org/jira/browse/SPARK-3881
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Spark User
Assignee: Spark User
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3881) This is a test JIRA

2014-10-09 Thread Spark User (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Spark User updated SPARK-3881:
--
Fix Version/s: 1.2.0

 This is a test JIRA
 ---

 Key: SPARK-3881
 URL: https://issues.apache.org/jira/browse/SPARK-3881
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Spark User
Assignee: Spark User
 Fix For: 1.2.0, 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3881) This is a test JIRA

2014-10-09 Thread Spark User (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Spark User resolved SPARK-3881.
---
Resolution: Invalid

 This is a test JIRA
 ---

 Key: SPARK-3881
 URL: https://issues.apache.org/jira/browse/SPARK-3881
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Reporter: Spark User
Assignee: Spark User
 Fix For: 1.2.0, 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-09 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165480#comment-14165480
 ] 

RJ Nowling commented on SPARK-2429:
---

Great work, Yu!

Ok, first off, let me make sure I understand what you're doing.  You start with 
2 centers.  You assign all the points.  You then apply KMeans recursively to 
each cluster, splitting each center into 2 centers.  Each instance of KMeans 
stops when the error is below a certain value or a fixed number of iterations 
have been run.

I think your analysis of the overall run time is good and probably what we 
expect.  Can you break down the timing to see which parts are the most 
expensive?  Maybe we can figure out where to optimize it.

A few thoughts on optimization:
1. It might be good to convert everything to Breeze vectors before you do any 
operations -- you need to convert the same vectors over and over again.  KMeans 
converts them at the beginning and converts the vectors for the centers back at 
the end.

2. Instead of passing the centers as part of the EuclideanClosestCenterFinder, 
look into using a broadcast variable.  See the latest KMeans implementation. 
This could improve performance by 10%+.

3. You may want to look into using reduceByKey or similar RDD operations -- 
they will enable parallel reductions which will be faster than a loop on the 
master.

If you look at the JIRAs and PRs, there is some recent work to speed up KMeans 
-- maybe some of that is applicable?

I'll probably have more questions -- it's a good way of helping me understand 
what you're doing :)

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3741) ConnectionManager.sendMessage may not propagate errors to MessageStatus

2014-10-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3741.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2593
[https://github.com/apache/spark/pull/2593]

 ConnectionManager.sendMessage may not propagate errors to MessageStatus
 ---

 Key: SPARK-3741
 URL: https://issues.apache.org/jira/browse/SPARK-3741
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Priority: Minor
 Fix For: 1.2.0


 If some network exception happens, ConnectionManager.sendMessage won't notify 
 MessageStatus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2014-10-09 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165500#comment-14165500
 ] 

Josh Rosen commented on SPARK-3644:
---

I think that a REST/JSON API is going to share many of the same design concerns 
as a stable Java/Scala-based progress reporting API for Spark.  It would be 
great to use consistent naming across both APIs.

SPARK-2321 deals with a Java API to expose programmatic access to many of the 
same things that a REST API would expose.  I have a pull request open for 
discussing the design of this API: https://github.com/apache/spark/pull/2696

It would be great if anyone here would comment on that PR / JIRA so that we can 
work out the basic issues of what to expose / how to expose it in the Java API. 
 Once we've figured this out, providing a REST wrapper should be fairly trivial.

 REST API for Spark application info (jobs / stages / tasks / storage info)
 --

 Key: SPARK-3644
 URL: https://issues.apache.org/jira/browse/SPARK-3644
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Reporter: Josh Rosen

 This JIRA is a forum to draft a design proposal for a REST interface for 
 accessing information about Spark applications, such as job / stage / task / 
 storage status.
 There have been a number of proposals to serve JSON representations of the 
 information displayed in Spark's web UI.  Given that we might redesign the 
 pages of the web UI (and possibly re-implement the UI as a client of a REST 
 API), the API endpoints and their responses should be independent of what we 
 choose to display on particular web UI pages / layouts.
 Let's start a discussion of what a good REST API would look like from 
 first-principles.  We can discuss what urls / endpoints expose access to 
 data, how our JSON responses will be formatted, how fields will be named, how 
 the API will be documented and tested, etc.
 Some links for inspiration:
 https://developer.github.com/v3/
 http://developer.netflix.com/docs/REST_API_Reference
 https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3854) Scala style: require spaces before `{`

2014-10-09 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165507#comment-14165507
 ] 

Josh Rosen commented on SPARK-3854:
---

Hey [~sarutak],

I don't really know anything about how Scalastyle works, so I'm going to defer 
to  [~prashant_] (ScrapCodes on GitHub), who implemented our current Scalastyle 
extensions: 
https://github.com/apache/spark/tree/master/project/spark-style/src/main/scala/org/apache/spark/scalastyle

 Scala style: require spaces before `{`
 --

 Key: SPARK-3854
 URL: https://issues.apache.org/jira/browse/SPARK-3854
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Josh Rosen

 We should require spaces before opening curly braces.  This isn't in the 
 style guide, but it probably should be:
 {code}
 // Correct:
 if (true) {
   println(Wow!)
 }
 // Incorrect:
 if (true){
println(Wow!)
 }
 {code}
 See https://github.com/apache/spark/pull/1658#discussion-diff-18611791 for an 
 example in the wild.
 {{git grep ){}} shows only a few occurrences of this style.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-10-09 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165513#comment-14165513
 ] 

Matt Cheah commented on SPARK-3736:
---

Are the two linked cases above different though?

(1) If the worker itself gets locked up, the master sends a heartbeat but the 
worker doesn't respond, and the master drops the connection with the worker. 
However the master doesn't send a message to the worker indicating this 
disconnection, so the worker can't know to reconnect. To repro this I set a 
breakpoint in the Worker's heartbeat reception code and let the worker time 
out, and after the worker times out it never receives a DissassociatedEvent, 
nor is Worker.masterDisconnected() ever called.

(2) If the master crashes, the Worker receives a DissassociatedEvent and sits 
idly. We can fix this with actively attempting to reconnect.

Clearly we can address the second case with the Worker actively trying to 
reconnect itself. But how can we address the first case?

 Workers should reconnect to Master if disconnected
 --

 Key: SPARK-3736
 URL: https://issues.apache.org/jira/browse/SPARK-3736
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Andrew Ash
Assignee: Matthew Cheah
Priority: Critical

 In standalone mode, when a worker gets disconnected from the master for some 
 reason it never attempts to reconnect.  In this situation you have to bounce 
 the worker before it will reconnect to the master.
 The preferred alternative is to follow what Hadoop does -- when there's a 
 disconnect, attempt to reconnect at a particular interval until successful (I 
 think it repeats indefinitely every 10sec).
 This has been observed by:
 - [~pkolaczk] in 
 http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
 - [~romi-totango] in 
 http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
 - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-10-09 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165513#comment-14165513
 ] 

Matt Cheah edited comment on SPARK-3736 at 10/9/14 6:42 PM:


Are the two linked cases above different though?

(1) If the worker itself gets locked up, the master sends a heartbeat but the 
worker doesn't respond, and the master drops the connection with the worker. 
However the master doesn't send a message to the worker indicating this 
disconnection, so the worker can't know to reconnect. To repro this I set a 
breakpoint in the Worker's heartbeat reception code and let the worker time 
out, and after the worker times out it never receives a DissassociatedEvent, 
nor is Worker.masterDisconnected() ever called.

(2) If the master crashes, the Worker receives a DissassociatedEvent and sits 
idly. We can fix this with making the Worker actively attempt to reconnect.

Clearly we can address the second case with the Worker actively trying to 
reconnect itself. But how can we address the first case?


was (Author: mcheah):
Are the two linked cases above different though?

(1) If the worker itself gets locked up, the master sends a heartbeat but the 
worker doesn't respond, and the master drops the connection with the worker. 
However the master doesn't send a message to the worker indicating this 
disconnection, so the worker can't know to reconnect. To repro this I set a 
breakpoint in the Worker's heartbeat reception code and let the worker time 
out, and after the worker times out it never receives a DissassociatedEvent, 
nor is Worker.masterDisconnected() ever called.

(2) If the master crashes, the Worker receives a DissassociatedEvent and sits 
idly. We can fix this with actively attempting to reconnect.

Clearly we can address the second case with the Worker actively trying to 
reconnect itself. But how can we address the first case?

 Workers should reconnect to Master if disconnected
 --

 Key: SPARK-3736
 URL: https://issues.apache.org/jira/browse/SPARK-3736
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Andrew Ash
Assignee: Matthew Cheah
Priority: Critical

 In standalone mode, when a worker gets disconnected from the master for some 
 reason it never attempts to reconnect.  In this situation you have to bounce 
 the worker before it will reconnect to the master.
 The preferred alternative is to follow what Hadoop does -- when there's a 
 disconnect, attempt to reconnect at a particular interval until successful (I 
 think it repeats indefinitely every 10sec).
 This has been observed by:
 - [~pkolaczk] in 
 http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
 - [~romi-totango] in 
 http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
 - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job

2014-10-09 Thread Davis Shepherd (JIRA)
Davis Shepherd created SPARK-3882:
-

 Summary: JobProgressListener gets permanently out of sync with 
long running job
 Key: SPARK-3882
 URL: https://issues.apache.org/jira/browse/SPARK-3882
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.2
Reporter: Davis Shepherd


A long running spark context (non-streaming) will eventually start throwing the 
following in the driver:

java.util.NoSuchElementException: key not found: 12771
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:58)
  at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
  at 
org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at 
org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
  at 
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
  at scala.Option.foreach(Option.scala:236)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
2014-10-09 18:45:33,523 [SparkListenerBus] ERROR 
org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener threw 
an exception
java.util.NoSuchElementException: key not found: 12782
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:58)
  at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
  at 
org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
  at 
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at 
org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
  at 
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
  at 
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
  at scala.Option.foreach(Option.scala:236)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
  at 
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)

And the ui will show running jobs that are in fact no longer running and never 
clean them up. (see attached screenshot)

The result is that the ui becomes unusable, and the JobProgressListener leaks 

[jira] [Updated] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job

2014-10-09 Thread Davis Shepherd (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Shepherd updated SPARK-3882:
--
Attachment: Screen Shot 2014-10-03 at 12.50.59 PM.png

Lots of orphaned jobs.

 JobProgressListener gets permanently out of sync with long running job
 --

 Key: SPARK-3882
 URL: https://issues.apache.org/jira/browse/SPARK-3882
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.2
Reporter: Davis Shepherd
 Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png


 A long running spark context (non-streaming) will eventually start throwing 
 the following in the driver:
 java.util.NoSuchElementException: key not found: 12771
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR 
 org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener 
 threw an exception
 java.util.NoSuchElementException: key not found: 12782
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
   at 
 

[jira] [Comment Edited] (SPARK-3882) JobProgressListener gets permanently out of sync with long running job

2014-10-09 Thread Davis Shepherd (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165535#comment-14165535
 ] 

Davis Shepherd edited comment on SPARK-3882 at 10/9/14 6:51 PM:


Attached web ui screenshot.


was (Author: dgshep):
Lots of orphaned jobs.

 JobProgressListener gets permanently out of sync with long running job
 --

 Key: SPARK-3882
 URL: https://issues.apache.org/jira/browse/SPARK-3882
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.2
Reporter: Davis Shepherd
 Attachments: Screen Shot 2014-10-03 at 12.50.59 PM.png


 A long running spark context (non-streaming) will eventually start throwing 
 the following in the driver:
 java.util.NoSuchElementException: key not found: 12771
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
 2014-10-09 18:45:33,523 [SparkListenerBus] ERROR 
 org.apache.spark.scheduler.LiveListenerBus - Listener JobProgressListener 
 threw an exception
 java.util.NoSuchElementException: key not found: 12782
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at scala.collection.AbstractMap.default(Map.scala:58)
   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
   at 
 org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
   at 
 org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
   at 
 org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
   at 
 

[jira] [Created] (SPARK-3883) Provide SSL support for Akka and HttpServer based connections

2014-10-09 Thread Jacek Lewandowski (JIRA)
Jacek Lewandowski created SPARK-3883:


 Summary: Provide SSL support for Akka and HttpServer based 
connections
 Key: SPARK-3883
 URL: https://issues.apache.org/jira/browse/SPARK-3883
 Project: Spark
  Issue Type: Story
  Components: Spark Core
Reporter: Jacek Lewandowski


Spark uses at least 4 logical communication channels:
1. Control messages - Akka based
2. JARs and other files - Jetty based (HttpServer)
3. Computation results - Java NIO based
4. Web UI - Jetty based

The aim of this feature is to enable SSL for (1) and (2).

Why:
Spark configuration is sent through (1). Spark configuration may contain 
sensitive information like credentials for accessing external data sources or 
streams. Application JAR files (2) may include the application logic and 
therefore they may include information about the structure of the external data 
sources, and credentials as well. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3711) Optimize where in clause filter queries

2014-10-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3711.
-
   Resolution: Fixed
Fix Version/s: (was: 1.1.1)
   1.2.0

Issue resolved by pull request 2561
[https://github.com/apache/spark/pull/2561]

 Optimize where in clause filter queries
 ---

 Key: SPARK-3711
 URL: https://issues.apache.org/jira/browse/SPARK-3711
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yash Datta
Priority: Minor
 Fix For: 1.2.0


 The In case class is replaced by a InSet class in case all the filters are 
 literals, which uses a hashset instead of Sequence,  thereby giving 
 significant performance improvement. Maximum improvement should be visible in 
 case small percentage of large data matches the filter list  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3806) minor bug in CliSuite

2014-10-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3806.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2666
[https://github.com/apache/spark/pull/2666]

 minor bug in CliSuite
 -

 Key: SPARK-3806
 URL: https://issues.apache.org/jira/browse/SPARK-3806
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


 Clisuite throw exception as follows:
 Exception in thread Thread-6 java.lang.IndexOutOfBoundsException: 6
   at 
 scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67)
   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
   at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96)
   at 
 scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
   at 
 scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
   at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175)
   at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179)
   at 
 scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164)
   at 
 scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162)
   at 
 scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73)
   at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2014-10-09 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165702#comment-14165702
 ] 

Reynold Xin commented on SPARK-3376:


It is definitely possible. We should evaluate the benefit. What I find recently 
is that with SSDs and zero copy send, disk-based shuffle can be pretty fast as 
well. That is, the network (assuming 10G) is the new bottleneck. 

 Memory-based shuffle strategy to reduce overhead of disk I/O
 

 Key: SPARK-3376
 URL: https://issues.apache.org/jira/browse/SPARK-3376
 Project: Spark
  Issue Type: Planned Work
Reporter: uncleGen
Priority: Trivial

 I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
 want to know is there any plan to do something about it. Or any suggestion 
 about it. Base on the work (SPARK-2044), it is feasible to have several 
 implementations of  shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3868) Hard to recognize which module is tested from unit-tests.log

2014-10-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-3868:
-

Assignee: Josh Rosen

 Hard to recognize which module is tested from unit-tests.log
 

 Key: SPARK-3868
 URL: https://issues.apache.org/jira/browse/SPARK-3868
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20
Reporter: cocoatomo
Assignee: Josh Rosen
  Labels: pyspark, testing
 Fix For: 1.2.0


 ./python/run-tests script display messages about which test it is running 
 currently on stdout but not write them on unit-tests.log.
 It is harder for us to recognize what test programs were executed and which 
 test was failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3835) Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI

2014-10-09 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165722#comment-14165722
 ] 

Matt Cheah commented on SPARK-3835:
---

Any updates on this?

 Spark applications that are killed should show up as KILLED or CANCELLED 
 in the Spark UI
 

 Key: SPARK-3835
 URL: https://issues.apache.org/jira/browse/SPARK-3835
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Matt Cheah
  Labels: UI

 Spark applications that crash or are killed are listed as FINISHED in the 
 Spark UI.
 It looks like the Master only passes back a list of Running applications 
 and a list of Completed applications, All of the applications under 
 Completed have status FINISHED, however if they were killed manually they 
 should show CANCELLED, or if they failed they should read FAILED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3868) Hard to recognize which module is tested from unit-tests.log

2014-10-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3868:
--
Assignee: cocoatomo  (was: Josh Rosen)

 Hard to recognize which module is tested from unit-tests.log
 

 Key: SPARK-3868
 URL: https://issues.apache.org/jira/browse/SPARK-3868
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20
Reporter: cocoatomo
Assignee: cocoatomo
  Labels: pyspark, testing
 Fix For: 1.2.0


 ./python/run-tests script display messages about which test it is running 
 currently on stdout but not write them on unit-tests.log.
 It is harder for us to recognize what test programs were executed and which 
 test was failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3835) Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI

2014-10-09 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165722#comment-14165722
 ] 

Matt Cheah edited comment on SPARK-3835 at 10/9/14 8:48 PM:


Any updates on this? I've tried tackling it myself but I'm actually not sure 
how possible this is - killing a JVM just causes a DisassociatedEvent to be 
fired... but a DisassociatedEvent is also fired if SparkContext.stop() is 
called, making it hard to tell if a context was stopped gracefully or 
forcefully.


was (Author: mcheah):
Any updates on this?

 Spark applications that are killed should show up as KILLED or CANCELLED 
 in the Spark UI
 

 Key: SPARK-3835
 URL: https://issues.apache.org/jira/browse/SPARK-3835
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Matt Cheah
  Labels: UI

 Spark applications that crash or are killed are listed as FINISHED in the 
 Spark UI.
 It looks like the Master only passes back a list of Running applications 
 and a list of Completed applications, All of the applications under 
 Completed have status FINISHED, however if they were killed manually they 
 should show CANCELLED, or if they failed they should read FAILED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3853) JsonRDD does not support converting fields to type Timestamp

2014-10-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3853.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2720
[https://github.com/apache/spark/pull/2720]

 JsonRDD does not support converting fields to type Timestamp
 

 Key: SPARK-3853
 URL: https://issues.apache.org/jira/browse/SPARK-3853
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Michael Timper
 Fix For: 1.2.0


 create a SchemaRDD using 
 eventsSchema = sqlContext.jsonRDD(jsonEventsRdd, schemaWithTimestampField)
 eventsSchema.registerTempTable(events)
 sqlContext.sql(select max(time_field) from events)
 Throws this exception:
 scala.MatchError: TimestampType (of class 
 org.apache.spark.sql.catalyst.types.TimestampType$)
 
 org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:357)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:391)
 ..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3884) Don't set SPARK_SUBMIT_DRIVER_MEMORY if deploy mode is cluster

2014-10-09 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3884:
-

 Summary: Don't set SPARK_SUBMIT_DRIVER_MEMORY if deploy mode is 
cluster
 Key: SPARK-3884
 URL: https://issues.apache.org/jira/browse/SPARK-3884
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3884) If deploy mode is cluster, --driver-memory shouldn't apply to client JVM

2014-10-09 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3884:
--
Summary: If deploy mode is cluster, --driver-memory shouldn't apply to 
client JVM  (was: Don't set SPARK_SUBMIT_DRIVER_MEMORY if deploy mode is 
cluster)

 If deploy mode is cluster, --driver-memory shouldn't apply to client JVM
 

 Key: SPARK-3884
 URL: https://issues.apache.org/jira/browse/SPARK-3884
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3835) Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI

2014-10-09 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165768#comment-14165768
 ] 

Nan Zhu commented on SPARK-3835:


this problem still exists? I once reported the same thing in SPARK-1118



 Spark applications that are killed should show up as KILLED or CANCELLED 
 in the Spark UI
 

 Key: SPARK-3835
 URL: https://issues.apache.org/jira/browse/SPARK-3835
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Matt Cheah
  Labels: UI

 Spark applications that crash or are killed are listed as FINISHED in the 
 Spark UI.
 It looks like the Master only passes back a list of Running applications 
 and a list of Completed applications, All of the applications under 
 Completed have status FINISHED, however if they were killed manually they 
 should show CANCELLED, or if they failed they should read FAILED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3885) Provide mechanism to remove accumulators once they are no longer used

2014-10-09 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-3885:
-

 Summary: Provide mechanism to remove accumulators once they are no 
longer used
 Key: SPARK-3885
 URL: https://issues.apache.org/jira/browse/SPARK-3885
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0, 1.0.2, 1.2.0
Reporter: Josh Rosen


Spark does not currently provide any mechanism to delete accumulators after 
they are no longer used.  This can lead to OOMs for long-lived SparkContexts 
that create many large accumulators.

Part of the problem is that accumulators are registered in a global 
{{Accumulators}} registry.  Maybe the fix would be as simple as using weak 
references in the Accumulators registry so that accumulators can be GC'd once 
they can no longer be used.

In the meantime, here's a workaround that users can try:

Accumulators have a public setValue() method that can be called (only by the 
driver) to change an accumulator’s value.  You might be able to use this to 
reset accumulators’ values to smaller objects (e.g. the “zero” object of 
whatever your accumulator type is, or ‘null’ if you’re sure that the 
accumulator will never be accessed again).

This issue was originally reported by [~nkronenfeld] on the dev mailing list: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-Accumulator-question-td8709.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3884) If deploy mode is cluster, --driver-memory shouldn't apply to client JVM

2014-10-09 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165776#comment-14165776
 ] 

Sandy Ryza commented on SPARK-3884:
---

Accidentally assigned this to myself, but others should feel free to pick it up

 If deploy mode is cluster, --driver-memory shouldn't apply to client JVM
 

 Key: SPARK-3884
 URL: https://issues.apache.org/jira/browse/SPARK-3884
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3835) Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI

2014-10-09 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165785#comment-14165785
 ] 

Matt Cheah commented on SPARK-3835:
---

This is the opposite problem, actually - a Spark context that is killed 
forcefully, i.e. kill -9 on the JVM hosting the context, is shown as FINISHED 
but should be shown as KILLED.

 Spark applications that are killed should show up as KILLED or CANCELLED 
 in the Spark UI
 

 Key: SPARK-3835
 URL: https://issues.apache.org/jira/browse/SPARK-3835
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Matt Cheah
  Labels: UI

 Spark applications that crash or are killed are listed as FINISHED in the 
 Spark UI.
 It looks like the Master only passes back a list of Running applications 
 and a list of Completed applications, All of the applications under 
 Completed have status FINISHED, however if they were killed manually they 
 should show CANCELLED, or if they failed they should read FAILED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3814) Bitwise does not work in Hive

2014-10-09 Thread Ravindra Pesala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165790#comment-14165790
 ] 

Ravindra Pesala commented on SPARK-3814:


https://github.com/apache/spark/pull/2736

 Bitwise  does not work  in Hive
 

 Key: SPARK-3814
 URL: https://issues.apache.org/jira/browse/SPARK-3814
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yana Kadiyska
Priority: Minor

 Error: java.lang.RuntimeException: 
 Unsupported language features in query: select (case when bit_field  1=1 
 then r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' 
 LIMIT 2
 TOK_QUERY
   TOK_FROM
 TOK_TABREF
   TOK_TABNAME
mytable 
   TOK_INSERT
 TOK_DESTINATION
   TOK_DIR
 TOK_TMP_FILE
 TOK_SELECT
   TOK_SELEXPR
 TOK_FUNCTION
   when
   =
 
   TOK_TABLE_OR_COL
 bit_field
   1
 1
   -
 TOK_TABLE_OR_COL
   r_end
 TOK_TABLE_OR_COL
   r_start
   TOK_NULL
 TOK_WHERE
   =
 TOK_TABLE_OR_COL
   pkey
 '0178-2014-07'
 TOK_LIMIT
   2
 SQLState:  null
 ErrorCode: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3834) Backticks not correctly handled in subquery aliases

2014-10-09 Thread Ravindra Pesala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165800#comment-14165800
 ] 

Ravindra Pesala commented on SPARK-3834:


https://github.com/apache/spark/pull/2737

 Backticks not correctly handled in subquery aliases
 ---

 Key: SPARK-3834
 URL: https://issues.apache.org/jira/browse/SPARK-3834
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Ravindra Pesala
Priority: Blocker

 [~ravi.pesala]  assigning to you since you fixed the last problem here.  Let 
 me know if you don't have time to work on this or if you have any questions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3835) Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI

2014-10-09 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165807#comment-14165807
 ] 

Nan Zhu commented on SPARK-3835:


ah, I see, didn't look at your description closely


Does shutdown hook work?



 Spark applications that are killed should show up as KILLED or CANCELLED 
 in the Spark UI
 

 Key: SPARK-3835
 URL: https://issues.apache.org/jira/browse/SPARK-3835
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Matt Cheah
  Labels: UI

 Spark applications that crash or are killed are listed as FINISHED in the 
 Spark UI.
 It looks like the Master only passes back a list of Running applications 
 and a list of Completed applications, All of the applications under 
 Completed have status FINISHED, however if they were killed manually they 
 should show CANCELLED, or if they failed they should read FAILED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3835) Spark applications that are killed should show up as KILLED or CANCELLED in the Spark UI

2014-10-09 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165815#comment-14165815
 ] 

Nan Zhu commented on SPARK-3835:


no...it cannot capture kill -9

 Spark applications that are killed should show up as KILLED or CANCELLED 
 in the Spark UI
 

 Key: SPARK-3835
 URL: https://issues.apache.org/jira/browse/SPARK-3835
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Matt Cheah
  Labels: UI

 Spark applications that crash or are killed are listed as FINISHED in the 
 Spark UI.
 It looks like the Master only passes back a list of Running applications 
 and a list of Completed applications, All of the applications under 
 Completed have status FINISHED, however if they were killed manually they 
 should show CANCELLED, or if they failed they should read FAILED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3339) Support for skipping json lines that fail to parse

2014-10-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3339.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2680
[https://github.com/apache/spark/pull/2680]

 Support for skipping json lines that fail to parse
 --

 Key: SPARK-3339
 URL: https://issues.apache.org/jira/browse/SPARK-3339
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Yin Huai
Priority: Critical
 Fix For: 1.2.0


 When dealing with large datasets there is alway some data that fails to 
 parse.  Would be nice to handle this instead of throwing an exception 
 requiring the user to filter it out manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3796) Create shuffle service for external block storage

2014-10-09 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-3796:
--
Description: 
This task will be broken up into two parts -- the first, being to refactor our 
internal shuffle service to use a BlockTransferService which we can easily 
extract out into its own service, and then the second is to actually do the 
extraction.

Here is the design document for the low-level service, nicknamed Sluice, on 
top of which will be Spark's BlockTransferService API:
https://docs.google.com/document/d/1zKf3qloBu3dmv2AFyQTwEpumWRPUT5bcAUKB5PGNfx0

 Create shuffle service for external block storage
 -

 Key: SPARK-3796
 URL: https://issues.apache.org/jira/browse/SPARK-3796
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Aaron Davidson

 This task will be broken up into two parts -- the first, being to refactor 
 our internal shuffle service to use a BlockTransferService which we can 
 easily extract out into its own service, and then the second is to actually 
 do the extraction.
 Here is the design document for the low-level service, nicknamed Sluice, on 
 top of which will be Spark's BlockTransferService API:
 https://docs.google.com/document/d/1zKf3qloBu3dmv2AFyQTwEpumWRPUT5bcAUKB5PGNfx0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3412) Add Missing Types for Row API

2014-10-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3412.
-
Resolution: Fixed

Issue resolved by pull request 2529
[https://github.com/apache/spark/pull/2529]

 Add Missing Types for Row API
 -

 Key: SPARK-3412
 URL: https://issues.apache.org/jira/browse/SPARK-3412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3858) SchemaRDD.generate ignores alias argument

2014-10-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3858.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2721
[https://github.com/apache/spark/pull/2721]

 SchemaRDD.generate ignores alias argument
 -

 Key: SPARK-3858
 URL: https://issues.apache.org/jira/browse/SPARK-3858
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Nathan Howell
Priority: Minor
 Fix For: 1.2.0


 The {{alias}} argument to {{SchemaRDD.generate}} is discarded and a constant 
 {{None}} is supplied to the {{logical.Generate}} constructor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3813) Support case when conditional functions in Spark SQL

2014-10-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3813.
-
Resolution: Fixed

Issue resolved by pull request 2678
[https://github.com/apache/spark/pull/2678]

 Support case when conditional functions in Spark SQL
 --

 Key: SPARK-3813
 URL: https://issues.apache.org/jira/browse/SPARK-3813
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Ravindra Pesala
 Fix For: 1.2.0


 The SQL queries which has following conditional functions are not supported 
 in Spark SQL.
 {code}
 CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
 CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
 {code}
 The same functions can work in Spark HiveQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3873) Scala style: check import ordering

2014-10-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165916#comment-14165916
 ] 

Marcelo Vanzin commented on SPARK-3873:
---

Actually looking at this, since I've been playing with the scalariform API in 
other places anyway...

 Scala style: check import ordering
 --

 Key: SPARK-3873
 URL: https://issues.apache.org/jira/browse/SPARK-3873
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Marcelo Vanzin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3886) Choose the batch size of serializer based on size of object

2014-10-09 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3886:
-

 Summary: Choose the batch size of serializer based on size of 
object
 Key: SPARK-3886
 URL: https://issues.apache.org/jira/browse/SPARK-3886
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


The default batch size (1024) maybe will not work for huge objects, so it's 
better to choose the proper size based on the size of objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3772) RDD operation on IPython REPL failed with an illegal port number

2014-10-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3772.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2651
[https://github.com/apache/spark/pull/2651]

 RDD operation on IPython REPL failed with an illegal port number
 

 Key: SPARK-3772
 URL: https://issues.apache.org/jira/browse/SPARK-3772
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
 Environment: Mac OS X 10.9.5, Python 2.7.8, IPython 2.2.0
Reporter: cocoatomo
  Labels: pyspark
 Fix For: 1.2.0


 To reproduce this issue, we should execute following commands on the commit: 
 6e27cb630de69fa5acb510b4e2f6b980742b1957.
 {quote}
 $ PYSPARK_PYTHON=ipython ./bin/pyspark
 ...
 In [1]: file = sc.textFile('README.md')
 In [2]: file.first()
 ...
 14/10/03 08:50:13 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/10/03 08:50:13 WARN LoadSnappy: Snappy native library not loaded
 14/10/03 08:50:13 INFO FileInputFormat: Total input paths to process : 1
 14/10/03 08:50:13 INFO SparkContext: Starting job: runJob at 
 PythonRDD.scala:334
 14/10/03 08:50:13 INFO DAGScheduler: Got job 0 (runJob at 
 PythonRDD.scala:334) with 1 output partitions (allowLocal=true)
 14/10/03 08:50:13 INFO DAGScheduler: Final stage: Stage 0(runJob at 
 PythonRDD.scala:334)
 14/10/03 08:50:13 INFO DAGScheduler: Parents of final stage: List()
 14/10/03 08:50:13 INFO DAGScheduler: Missing parents: List()
 14/10/03 08:50:13 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at RDD 
 at PythonRDD.scala:44), which has no missing parents
 14/10/03 08:50:13 INFO MemoryStore: ensureFreeSpace(4456) called with 
 curMem=57388, maxMem=278019440
 14/10/03 08:50:13 INFO MemoryStore: Block broadcast_1 stored as values in 
 memory (estimated size 4.4 KB, free 265.1 MB)
 14/10/03 08:50:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (PythonRDD[2] at RDD at PythonRDD.scala:44)
 14/10/03 08:50:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 14/10/03 08:50:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
 localhost, PROCESS_LOCAL, 1207 bytes)
 14/10/03 08:50:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
 14/10/03 08:50:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
 java.lang.IllegalArgumentException: port out of range:1027423549
   at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
   at java.net.InetSocketAddress.init(InetSocketAddress.java:188)
   at java.net.Socket.init(Socket.java:244)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:75)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:90)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
   at 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:100)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:71)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:744)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3883) Provide SSL support for Akka and HttpServer based connections

2014-10-09 Thread Jacek Lewandowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165986#comment-14165986
 ] 

Jacek Lewandowski commented on SPARK-3883:
--

https://github.com/apache/spark/pull/2739

 Provide SSL support for Akka and HttpServer based connections
 -

 Key: SPARK-3883
 URL: https://issues.apache.org/jira/browse/SPARK-3883
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Jacek Lewandowski

 Spark uses at least 4 logical communication channels:
 1. Control messages - Akka based
 2. JARs and other files - Jetty based (HttpServer)
 3. Computation results - Java NIO based
 4. Web UI - Jetty based
 The aim of this feature is to enable SSL for (1) and (2).
 Why:
 Spark configuration is sent through (1). Spark configuration may contain 
 sensitive information like credentials for accessing external data sources or 
 streams. Application JAR files (2) may include the application logic and 
 therefore they may include information about the structure of the external 
 data sources, and credentials as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3887) ConnectionManager should log remote exception when reporting remote errors

2014-10-09 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-3887:
-

 Summary: ConnectionManager should log remote exception when 
reporting remote errors
 Key: SPARK-3887
 URL: https://issues.apache.org/jira/browse/SPARK-3887
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen


When reporting that a remote error occurred, the ConnectionManager should also 
log the stacktrace of the remote exception.  This can be accomplished by 
sending the remote exception's stacktrace as the payload in the negative ACK / 
error message that's sent by the error-handling code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3888) Limit the memory used by python worker

2014-10-09 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3888:
-

 Summary: Limit the memory used by python worker
 Key: SPARK-3888
 URL: https://issues.apache.org/jira/browse/SPARK-3888
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu


Right now, we did not limit the memory by Python workers, then it maybe run out 
of memory and freeze the OS. it's safe to have a configurable hard limitation 
for it, which should be large than spark.executor.python.memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK

2014-10-09 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-3889:
-

 Summary: JVM dies with SIGBUS, resulting in ConnectionManager 
failed ACK
 Key: SPARK-3889
 URL: https://issues.apache.org/jira/browse/SPARK-3889
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical


Here's the first part of the core dump:

{code}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704
#
# JRE version: 7.0_25-b30
# Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed 
oops)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try ulimit -c unlimited before starting Java again
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
#

---  T H R E A D  ---

Current thread (0x7fa4b0631000):  JavaThread Executor task launch 
worker-170 daemon [_thread_in_Java, id=6783, 
stack(0x7fa4448ef000,0x7fa4449f)]

siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), 
si_addr=0x7fa428f79000
{code}

Here is the only useful content I can find related to JVM and SIGBUS from 
Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664

It appears it may be related to disposing byte buffers, which we do in the 
ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of 
them in BufferMessage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK

2014-10-09 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-3889:
--
Description: 
Here's the first part of the core dump, possibly caused by a job which shuffles 
a lot of very small partitions.

{code}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704
#
# JRE version: 7.0_25-b30
# Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed 
oops)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try ulimit -c unlimited before starting Java again
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
#

---  T H R E A D  ---

Current thread (0x7fa4b0631000):  JavaThread Executor task launch 
worker-170 daemon [_thread_in_Java, id=6783, 
stack(0x7fa4448ef000,0x7fa4449f)]

siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), 
si_addr=0x7fa428f79000
{code}

Here is the only useful content I can find related to JVM and SIGBUS from 
Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664

It appears it may be related to disposing byte buffers, which we do in the 
ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of 
them in BufferMessage.

  was:
Here's the first part of the core dump:

{code}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704
#
# JRE version: 7.0_25-b30
# Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed 
oops)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try ulimit -c unlimited before starting Java again
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
#

---  T H R E A D  ---

Current thread (0x7fa4b0631000):  JavaThread Executor task launch 
worker-170 daemon [_thread_in_Java, id=6783, 
stack(0x7fa4448ef000,0x7fa4449f)]

siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), 
si_addr=0x7fa428f79000
{code}

Here is the only useful content I can find related to JVM and SIGBUS from 
Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664

It appears it may be related to disposing byte buffers, which we do in the 
ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of 
them in BufferMessage.


 JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
 ---

 Key: SPARK-3889
 URL: https://issues.apache.org/jira/browse/SPARK-3889
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical

 Here's the first part of the core dump, possibly caused by a job which 
 shuffles a lot of very small partitions.
 {code}
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704
 #
 # JRE version: 7.0_25-b30
 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 
 compressed oops)
 # Problematic frame:
 # v  ~StubRoutines::jbyte_disjoint_arraycopy
 #
 # Failed to write core dump. Core dumps have been disabled. To enable core 
 dumping, try ulimit -c unlimited before starting Java again
 #
 # If you would like to submit a bug report, please include
 # instructions on how to reproduce the bug and visit:
 #   https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
 #
 ---  T H R E A D  ---
 Current thread (0x7fa4b0631000):  JavaThread Executor task launch 
 worker-170 daemon [_thread_in_Java, id=6783, 
 stack(0x7fa4448ef000,0x7fa4449f)]
 siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), 
 si_addr=0x7fa428f79000
 {code}
 Here is the only useful content I can find related to JVM and SIGBUS from 
 Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664
 It appears it may be related to disposing byte buffers, which we do in the 
 ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of 
 them in BufferMessage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Resolved] (SPARK-3798) Corrupted projection in Generator

2014-10-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3798.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2656
[https://github.com/apache/spark/pull/2656]

 Corrupted projection in Generator
 -

 Key: SPARK-3798
 URL: https://issues.apache.org/jira/browse/SPARK-3798
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.2.0


 In some cases it is possible for the output of a generator to change, 
 resulting in a corrupted projection and thus incorrect data from a query that 
 uses a generator (e.g., LATERAL VIEW explode).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3811) More robust / standard Utils.deleteRecursively, Utils.createTempDir

2014-10-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3811.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2670
[https://github.com/apache/spark/pull/2670]

 More robust / standard Utils.deleteRecursively, Utils.createTempDir
 ---

 Key: SPARK-3811
 URL: https://issues.apache.org/jira/browse/SPARK-3811
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sean Owen
Priority: Minor
 Fix For: 1.2.0


 I noticed a few issues with how temp directories are created and deleted:
 *Minor*
 * Guava's {{Files.createTempDir()}} plus {{File.deleteOnExit()}} is used in 
 many tests to make a temp dir, but {{Utils.createTempDir()}} seems to be the 
 standard Spark mechanism
 * Call to {{File.deleteOnExit()}} could be pushed into 
 {{Utils.createTempDir()}} as well, along with this replacement.
 * _I messed up the message in an exception in {{Utils}} in SPARK-3794; fixed 
 here_
 *Bit Less Minor*
 * {{Utils.deleteRecursively()}} fails immediately if any {{IOException}} 
 occurs, instead of trying to delete any remaining files and subdirectories. 
 I've observed this leave temp dirs around. I suggest changing it to continue 
 in the face of an exception and throw one of the possibly several exceptions 
 that occur at the end.
 * {{Utils.createTempDir()}} will add a JVM shutdown hook every time the 
 method is called. Even if the subdir is the parent of another parent dir, 
 since this check is inside the hook. However {{Utils}} manages a set of all 
 dirs to delete on shutdown already, called {{shutdownDeletePaths}}. A single 
 hook can be registered to delete all of these on exit. This is how Tachyon 
 temp paths are cleaned up in {{TachyonBlockManager}}.
 I noticed a few other things that might be changed but wanted to ask first:
 * Shouldn't the set of dirs to delete be {{File}}, not just {{String}} paths?
 * {{Utils}} manages the set of {{TachyonFile}} that have been registered for 
 deletion, but the shutdown hook is managed in {{TachyonBlockManager}}. Should 
 this logic not live together, and not in {{Utils}}? it's more specific to 
 Tachyon, and looks a slight bit odd to import in such a generic place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3811) More robust / standard Utils.deleteRecursively, Utils.createTempDir

2014-10-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3811:

Assignee: Sean Owen

 More robust / standard Utils.deleteRecursively, Utils.createTempDir
 ---

 Key: SPARK-3811
 URL: https://issues.apache.org/jira/browse/SPARK-3811
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.2.0


 I noticed a few issues with how temp directories are created and deleted:
 *Minor*
 * Guava's {{Files.createTempDir()}} plus {{File.deleteOnExit()}} is used in 
 many tests to make a temp dir, but {{Utils.createTempDir()}} seems to be the 
 standard Spark mechanism
 * Call to {{File.deleteOnExit()}} could be pushed into 
 {{Utils.createTempDir()}} as well, along with this replacement.
 * _I messed up the message in an exception in {{Utils}} in SPARK-3794; fixed 
 here_
 *Bit Less Minor*
 * {{Utils.deleteRecursively()}} fails immediately if any {{IOException}} 
 occurs, instead of trying to delete any remaining files and subdirectories. 
 I've observed this leave temp dirs around. I suggest changing it to continue 
 in the face of an exception and throw one of the possibly several exceptions 
 that occur at the end.
 * {{Utils.createTempDir()}} will add a JVM shutdown hook every time the 
 method is called. Even if the subdir is the parent of another parent dir, 
 since this check is inside the hook. However {{Utils}} manages a set of all 
 dirs to delete on shutdown already, called {{shutdownDeletePaths}}. A single 
 hook can be registered to delete all of these on exit. This is how Tachyon 
 temp paths are cleaned up in {{TachyonBlockManager}}.
 I noticed a few other things that might be changed but wanted to ask first:
 * Shouldn't the set of dirs to delete be {{File}}, not just {{String}} paths?
 * {{Utils}} manages the set of {{TachyonFile}} that have been registered for 
 deletion, but the shutdown hook is managed in {{TachyonBlockManager}}. Should 
 this logic not live together, and not in {{Utils}}? it's more specific to 
 Tachyon, and looks a slight bit odd to import in such a generic place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3824) Spark SQL should cache in MEMORY_AND_DISK by default

2014-10-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3824.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2686
[https://github.com/apache/spark/pull/2686]

 Spark SQL should cache in MEMORY_AND_DISK by default
 

 Key: SPARK-3824
 URL: https://issues.apache.org/jira/browse/SPARK-3824
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Patrick Wendell
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.2.0


 Spark SQL currently uses MEMORY_ONLY as the default format. Due to the use of 
 column buffers however, there is a huge cost to having to recompute blocks, 
 much more so than Spark core. Especially since now we are more conservative 
 about caching blocks and sometimes won't cache blocks we think might exceed 
 memory, it seems good to keep persisted blocks on disk by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3834) Backticks not correctly handled in subquery aliases

2014-10-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3834.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2737
[https://github.com/apache/spark/pull/2737]

 Backticks not correctly handled in subquery aliases
 ---

 Key: SPARK-3834
 URL: https://issues.apache.org/jira/browse/SPARK-3834
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Ravindra Pesala
Priority: Blocker
 Fix For: 1.2.0


 [~ravi.pesala]  assigning to you since you fixed the last problem here.  Let 
 me know if you don't have time to work on this or if you have any questions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-10-09 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166168#comment-14166168
 ] 

Aaron Staple commented on SPARK-1503:
-

Hi, I’d like to try working on this ticket. If you’d like to assign it to me, I 
can write a short spec and then work on a PR.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3890) remove redundant spark.executor.memory in doc

2014-10-09 Thread WangTaoTheTonic (JIRA)
WangTaoTheTonic created SPARK-3890:
--

 Summary: remove redundant spark.executor.memory in doc
 Key: SPARK-3890
 URL: https://issues.apache.org/jira/browse/SPARK-3890
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: WangTaoTheTonic
Priority: Minor


Seems like there is a redundant spark.executor.memory config item in docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3795) Add scheduler hooks/heuristics for adding and removing executors

2014-10-09 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3795:
-
Affects Version/s: 1.1.0

 Add scheduler hooks/heuristics for adding and removing executors
 

 Key: SPARK-3795
 URL: https://issues.apache.org/jira/browse/SPARK-3795
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Patrick Wendell
Assignee: Andrew Or

 To support dynamic scaling of a Spark application, Spark's scheduler will 
 need to have hooks around explicitly decommissioning executors. We'll also 
 need basic heuristics governing when to start/stop executors based on load. 
 An initial goal is to keep this very simple.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3891) Support Hive Percentile UDAF with array of percentile values

2014-10-09 Thread Anand Mohan Tumuluri (JIRA)
Anand Mohan Tumuluri created SPARK-3891:
---

 Summary: Support Hive Percentile UDAF with array of percentile 
values
 Key: SPARK-3891
 URL: https://issues.apache.org/jira/browse/SPARK-3891
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: Spark 1.2.0 trunk 
(ac302052870a650d56f2d3131c27755bb2960ad7) on
CDH 5.1.0
Centos 6.5
8x 2GHz, 24GB RAM
Reporter: Anand Mohan Tumuluri


Spark PR 2620 brings in the support of Hive percentile UDAF.
However Hive percentile and percentile_approx UDAFs also support returning an 
array of percentile values with the syntax
percentile(BIGINT col, array(p1 [, p2]...)) or 
percentile_approx(DOUBLE col, array(p1 [, p2]...) [, B])

These queries are failing with the below error:

0: jdbc:hive2://dev-uuppala.sfohi.philips.com select name, 
percentile(turnaroundtime,array(0,0.25,0.5,0.75,1)) from exam group by name;

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
1 in stage 25.0 failed 4 times, most recent failure: Lost task 1.3 in stage 
25.0 (TID 305, Dev-uuppala.sfohi.philips.com): java.lang.ClassCastException: 
scala.collection.mutable.ArrayBuffer cannot be cast to [Ljava.lang.Object;

org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getListLength(StandardListObjectInspector.java:83)

org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.convert(ObjectInspectorConverters.java:259)

org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ConversionHelper.convertIfNecessary(GenericUDFUtils.java:349)

org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge$GenericUDAFBridgeEvaluator.iterate(GenericUDAFBridge.java:170)
org.apache.spark.sql.hive.HiveUdafFunction.update(hiveUdfs.scala:342)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
org.apache.spark.scheduler.Task.run(Task.scala:56)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace: (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3892) Map type should have typeName

2014-10-09 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-3892:
--

 Summary: Map type should have typeName
 Key: SPARK-3892
 URL: https://issues.apache.org/jira/browse/SPARK-3892
 Project: Spark
  Issue Type: Bug
Reporter: Adrian Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3893) declare mutableMap/mutableSet explicitly

2014-10-09 Thread sjk (JIRA)
sjk created SPARK-3893:
--

 Summary: declare  mutableMap/mutableSet explicitly
 Key: SPARK-3893
 URL: https://issues.apache.org/jira/browse/SPARK-3893
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: sjk



{code:java}
  // current
  val workers = new HashSet[WorkerInfo]
  // sugguest
  val workers = new mutable.HashSet[WorkerInfo]
{code}

the other benefit is reminding us whether can use immutable collection instead 
of.

most of map we used is mutable.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >