[jira] [Created] (SPARK-3027) Tighten the visibility of fields in TaskContext and provide Java friendly callback API

2014-08-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3027:
--

 Summary: Tighten the visibility of fields in TaskContext and 
provide Java friendly callback API
 Key: SPARK-3027
 URL: https://issues.apache.org/jira/browse/SPARK-3027
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin


TaskContext suffers from two problems:

1. The callback API accepts a Scala closure and is not Java friendly.

2. interrupted, completed are public vars, which means users can randomly 
change the control flow of tasks.





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3027) Tighten visibility and provide Java friendly callback API in TaskContext

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096632#comment-14096632
 ] 

Apache Spark commented on SPARK-3027:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1938

 Tighten visibility and provide Java friendly callback API in TaskContext
 

 Key: SPARK-3027
 URL: https://issues.apache.org/jira/browse/SPARK-3027
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 TaskContext suffers from two problems:
 1. The callback API accepts a Scala closure and is not Java friendly.
 2. interrupted, completed are public vars, which means users can randomly 
 change the control flow of tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3027) Tighten visibility and provide Java friendly callback API in TaskContext

2014-08-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3027:
---

Summary: Tighten visibility and provide Java friendly callback API in 
TaskContext  (was: Tighten the visibility of fields in TaskContext and provide 
Java friendly callback API)

 Tighten visibility and provide Java friendly callback API in TaskContext
 

 Key: SPARK-3027
 URL: https://issues.apache.org/jira/browse/SPARK-3027
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 TaskContext suffers from two problems:
 1. The callback API accepts a Scala closure and is not Java friendly.
 2. interrupted, completed are public vars, which means users can randomly 
 change the control flow of tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2014-08-14 Thread Kostiantyn Kudriavtsev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096636#comment-14096636
 ] 

Kostiantyn Kudriavtsev commented on SPARK-2356:
---

Guoqiang, Spark works not exclusively with Hadoop, but can live absolutely out 
of Hadoop cluster/environment. So, it's obvious that these two variables might 
be not set.

 Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
 ---

 Key: SPARK-2356
 URL: https://issues.apache.org/jira/browse/SPARK-2356
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kostiantyn Kudriavtsev
Priority: Critical

 I'm trying to run some transformation on Spark, it works fine on cluster 
 (YARN, linux machines). However, when I'm trying to run it on local machine 
 (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
 from local filesystem):
 {code}
 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
 hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
 Hadoop binaries.
   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
   at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
   at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
   at org.apache.hadoop.security.Groups.init(Groups.java:77)
   at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
   at org.apache.spark.SparkContext.init(SparkContext.scala:97)
 {code}
 It's happened because Hadoop config is initialized each time when spark 
 context is created regardless is hadoop required or not.
 I propose to add some special flag to indicate if hadoop config is required 
 (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2014-08-14 Thread Tarek Nabil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096635#comment-14096635
 ] 

Tarek Nabil commented on SPARK-2356:


Yes, but the whole point is that you should do not need Hadoop at all.

 Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
 ---

 Key: SPARK-2356
 URL: https://issues.apache.org/jira/browse/SPARK-2356
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kostiantyn Kudriavtsev
Priority: Critical

 I'm trying to run some transformation on Spark, it works fine on cluster 
 (YARN, linux machines). However, when I'm trying to run it on local machine 
 (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
 from local filesystem):
 {code}
 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
 hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
 Hadoop binaries.
   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
   at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
   at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
   at org.apache.hadoop.security.Groups.init(Groups.java:77)
   at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
   at org.apache.spark.SparkContext.init(SparkContext.scala:97)
 {code}
 It's happened because Hadoop config is initialized each time when spark 
 context is created regardless is hadoop required or not.
 I propose to add some special flag to indicate if hadoop config is required 
 (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate

2014-08-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3028:
--

 Summary: sparkEventToJson should support 
SparkListenerExecutorMetricsUpdate
 Key: SPARK-3028
 URL: https://issues.apache.org/jira/browse/SPARK-3028
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin
Priority: Blocker


SparkListenerExecutorMetricsUpdate was added without updating 
org.apache.spark.util.JsonProtocol.sparkEventToJson.

This can crash the listener.





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate

2014-08-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096645#comment-14096645
 ] 

Reynold Xin commented on SPARK-3028:


[~sandyr] [~sandyryza] Would you be able to do this quickly (tmr maybe)?

If not, somebody else should fix this ...

cc [~andrewor14]

 sparkEventToJson should support SparkListenerExecutorMetricsUpdate
 --

 Key: SPARK-3028
 URL: https://issues.apache.org/jira/browse/SPARK-3028
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin
Priority: Blocker

 SparkListenerExecutorMetricsUpdate was added without updating 
 org.apache.spark.util.JsonProtocol.sparkEventToJson.
 This can crash the listener.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2456) Scheduler refactoring

2014-08-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2456:
---

Component/s: Spark Core

 Scheduler refactoring
 -

 Key: SPARK-2456
 URL: https://issues.apache.org/jira/browse/SPARK-2456
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 This is an umbrella ticket to track scheduler refactoring. We want to clearly 
 define semantics and responsibilities of each component, and define explicit 
 public interfaces for them so it is easier to understand and to contribute 
 (also less buggy).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3029) Disable local execution of Spark jobs by default

2014-08-14 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-3029:
-

 Summary: Disable local execution of Spark jobs by default
 Key: SPARK-3029
 URL: https://issues.apache.org/jira/browse/SPARK-3029
 Project: Spark
  Issue Type: Improvement
Reporter: Aaron Davidson
Assignee: Aaron Davidson


Currently, local execution of Spark jobs is only used by take(), and it can be 
problematic as it can load a significant amount of data onto the driver. The 
worst case scenarios occur if the RDD is cached (guaranteed to load whole 
partition), has very large elements, or the partition is just large and we 
apply a filter with high selectivity or computational overhead.

Additionally, jobs that run locally in this manner do not show up in the web 
UI, and are thus harder to track or understand what is occurring.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3019) Pluggable block transfer (data plane communication) interface

2014-08-14 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096654#comment-14096654
 ] 

Hari Shreedharan commented on SPARK-3019:
-

Why specifically MapR FS? You could use HDFS for this as well right?

 Pluggable block transfer (data plane communication) interface
 -

 Key: SPARK-3019
 URL: https://issues.apache.org/jira/browse/SPARK-3019
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
 Attachments: PluggableBlockTransferServiceProposalforSpark - draft 
 1.pdf


 The attached design doc proposes a standard interface for block transferring, 
 which will make future engineering of this functionality easier, allowing the 
 Spark community to provide alternative implementations.
 Block transferring is a critical function in Spark. All of the following 
 depend on it:
 * shuffle
 * torrent broadcast
 * block replication in BlockManager
 * remote block reads for tasks scheduled without locality



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3030) reuse python worker

2014-08-14 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3030:
-

 Summary: reuse python worker
 Key: SPARK-3030
 URL: https://issues.apache.org/jira/browse/SPARK-3030
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu


Currently, it will fork an Python worker for each task, it will better if we 
can reuse the worker for later tasks.

This will be very useful for large dataset with big broadcast, so it does not 
need to sending broadcast to worker again and again. Also it can reduce the 
overhead of launch a task.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3019) Pluggable block transfer (data plane communication) interface

2014-08-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096658#comment-14096658
 ] 

Reynold Xin commented on SPARK-3019:


Possibly, although I think MapR FS is more optimized for this kind of workload, 
as they have done that previously for MR shuffle. Don't quote me on this one. 

 Pluggable block transfer (data plane communication) interface
 -

 Key: SPARK-3019
 URL: https://issues.apache.org/jira/browse/SPARK-3019
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
 Attachments: PluggableBlockTransferServiceProposalforSpark - draft 
 1.pdf


 The attached design doc proposes a standard interface for block transferring, 
 which will make future engineering of this functionality easier, allowing the 
 Spark community to provide alternative implementations.
 Block transferring is a critical function in Spark. All of the following 
 depend on it:
 * shuffle
 * torrent broadcast
 * block replication in BlockManager
 * remote block reads for tasks scheduled without locality



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1170) Add histogram() to PySpark

2014-08-14 Thread Chandan Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan Kumar resolved SPARK-1170.
--

Resolution: Duplicate

Davies is working on this.

 Add histogram() to PySpark
 --

 Key: SPARK-1170
 URL: https://issues.apache.org/jira/browse/SPARK-1170
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Matei Zaharia
Assignee: Josh Rosen
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3031) Create JsonSerializable and move JSON serialization from JsonProtocol into each class

2014-08-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096664#comment-14096664
 ] 

Reynold Xin commented on SPARK-3031:


cc [~andrewor14] [~adav]

 Create JsonSerializable and move JSON serialization from JsonProtocol into 
 each class
 -

 Key: SPARK-3031
 URL: https://issues.apache.org/jira/browse/SPARK-3031
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin

 It is really, really weird that we have a single file JsonProtocol that 
 handles the JSON serialization/deserialization for a bunch of classes. This 
 is very error prone because it is easy to add a new field to a class, and 
 forget to update the JSON part of it. Or worse, add a new event class and 
 forget to add one, as evidenced by 
 https://issues.apache.org/jira/browse/SPARK-3028



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3029) Disable local execution of Spark jobs by default

2014-08-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3029:
---

Priority: Blocker  (was: Major)
Target Version/s: 1.1.0

 Disable local execution of Spark jobs by default
 

 Key: SPARK-3029
 URL: https://issues.apache.org/jira/browse/SPARK-3029
 Project: Spark
  Issue Type: Improvement
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Blocker

 Currently, local execution of Spark jobs is only used by take(), and it can 
 be problematic as it can load a significant amount of data onto the driver. 
 The worst case scenarios occur if the RDD is cached (guaranteed to load whole 
 partition), has very large elements, or the partition is just large and we 
 apply a filter with high selectivity or computational overhead.
 Additionally, jobs that run locally in this manner do not show up in the web 
 UI, and are thus harder to track or understand what is occurring.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3031) Create JsonSerializable and move JSON serialization from JsonProtocol into each class

2014-08-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3031:
---

Component/s: Spark Core

 Create JsonSerializable and move JSON serialization from JsonProtocol into 
 each class
 -

 Key: SPARK-3031
 URL: https://issues.apache.org/jira/browse/SPARK-3031
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin

 It is really, really weird that we have a single file JsonProtocol that 
 handles the JSON serialization/deserialization for a bunch of classes. This 
 is very error prone because it is easy to add a new field to a class, and 
 forget to update the JSON part of it. Or worse, add a new event class and 
 forget to add one, as evidenced by 
 https://issues.apache.org/jira/browse/SPARK-3028



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3031) Create JsonSerializable and move JSON serialization from JsonProtocol into each class

2014-08-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3031:
--

 Summary: Create JsonSerializable and move JSON serialization from 
JsonProtocol into each class
 Key: SPARK-3031
 URL: https://issues.apache.org/jira/browse/SPARK-3031
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin


It is really, really weird that we have a single file JsonProtocol that handles 
the JSON serialization/deserialization for a bunch of classes. This is very 
error prone because it is easy to add a new field to a class, and forget to 
update the JSON part of it. Or worse, add a new event class and forget to add 
one, as evidenced by https://issues.apache.org/jira/browse/SPARK-3028



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2995) Allow to set storage level for intermediate RDDs in ALS

2014-08-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2995.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1913
[https://github.com/apache/spark/pull/1913]

 Allow to set storage level for intermediate RDDs in ALS
 ---

 Key: SPARK-2995
 URL: https://issues.apache.org/jira/browse/SPARK-2995
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.1.0


 As mentioned in [SPARK-2465], using MEMORY_AND_DISK_SER together with 
 spark.rdd.compress=true can help reduce the space requirement by a lot, at 
 the cost of speed. It might be useful to add this option so people can run 
 ALS on much bigger datasets.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2893) Should not swallow exception when cannot find custom Kryo registrator

2014-08-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2893:
---

Priority: Blocker  (was: Major)
Target Version/s: 1.1.0

 Should not swallow exception when cannot find custom Kryo registrator
 -

 Key: SPARK-2893
 URL: https://issues.apache.org/jira/browse/SPARK-2893
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Graham Dennis
Priority: Blocker

 If for any reason a custom Kryo registrator cannot be found, the task ( 
 application) should fail, because otherwise problems could occur later during 
 data serialisation / deserialisation.  See SPARK-2878 for an example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2893) Should not swallow exception when cannot find custom Kryo registrator

2014-08-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2893:
---

Assignee: Graham Dennis

 Should not swallow exception when cannot find custom Kryo registrator
 -

 Key: SPARK-2893
 URL: https://issues.apache.org/jira/browse/SPARK-2893
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Graham Dennis
Assignee: Graham Dennis
Priority: Blocker

 If for any reason a custom Kryo registrator cannot be found, the task ( 
 application) should fail, because otherwise problems could occur later during 
 data serialisation / deserialisation.  See SPARK-2878 for an example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2893) Should not swallow exception when cannot find custom Kryo registrator

2014-08-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2893:
---

Target Version/s: 1.1.0, 1.0.3  (was: 1.1.0)

 Should not swallow exception when cannot find custom Kryo registrator
 -

 Key: SPARK-2893
 URL: https://issues.apache.org/jira/browse/SPARK-2893
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Graham Dennis
Priority: Blocker

 If for any reason a custom Kryo registrator cannot be found, the task ( 
 application) should fail, because otherwise problems could occur later during 
 data serialisation / deserialisation.  See SPARK-2878 for an example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2878) Inconsistent Kryo serialisation with custom Kryo Registrator

2014-08-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2878:
---

Assignee: Graham Dennis

 Inconsistent Kryo serialisation with custom Kryo Registrator
 

 Key: SPARK-2878
 URL: https://issues.apache.org/jira/browse/SPARK-2878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.2
 Environment: Linux RedHat EL 6, 4-node Spark cluster.
Reporter: Graham Dennis
Assignee: Graham Dennis

 The custom Kryo Registrator (a class with the 
 org.apache.spark.serializer.KryoRegistrator trait) is not used with every 
 Kryo instance created, and this causes inconsistent serialisation and 
 deserialisation.
 The Kryo Registrator is sometimes not used because of a ClassNotFound 
 exception that only occurs if it *isn't* the Worker thread (of an Executor) 
 that tries to create the KryoRegistrator.
 A complete description of the problem and a project reproducing the problem 
 can be found at https://github.com/GrahamDennis/spark-kryo-serialisation
 I have currently only tested this with Spark 1.0.0, but will try to test 
 against 1.0.2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-08-14 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096608#comment-14096608
 ] 

Saisai Shao edited comment on SPARK-2926 at 8/14/14 7:12 AM:
-

Hi Matei, 

I just uploaded a Spark shuffle performance test report. In this report, I 
choose 3 different workloads (sort-by-key, aggregate-by-key and group-by-key) 
in SparkPerf to test the performance of current 3 shuffle implementations: 
hash-based shuffle; sort-based shuffle with HashShuffleReader; sort-based 
shuffle with sort-merge shuffle reader (our prototype). Generally for 
sort-by-key our prototype can gain more benefits than other two 
implementations, while for other two workloads the performance is almost the 
same.

Would you mind taking a look at it, any comment would be greatly appreciated, 
thanks a lot.


was (Author: jerryshao):
Hi Matei, 

I just uploaded a Spark shuffle performance test report. In this report, I 
choose 3 different workload (sort-by-key, aggregate-by-key and group-by-key) in 
SparkPerf to test the performance of current 3 shuffle implementations: 
hash-based shuffle; sort-based shuffle with HashShuffleReader; sort-based 
shuffle with sort-merge shuffle reader (our prototype).

Would you mind taking a look at it, any comment would be greatly appreciated, 
thanks a lot.

 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
 --

 Key: SPARK-2926
 URL: https://issues.apache.org/jira/browse/SPARK-2926
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: Saisai Shao
 Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report.pdf


 Currently Spark has already integrated sort-based shuffle write, which 
 greatly improve the IO performance and reduce the memory consumption when 
 reducer number is very large. But for the reducer side, it still adopts the 
 implementation of hash-based shuffle reader, which neglects the ordering 
 attributes of map output data in some situations.
 Here we propose a MR style sort-merge like shuffle reader for sort-based 
 shuffle to better improve the performance of sort-based shuffle.
 Working in progress code and performance test report will be posted later 
 when some unit test bugs are fixed.
 Any comments would be greatly appreciated. 
 Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2878) Inconsistent Kryo serialisation with custom Kryo Registrator

2014-08-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2878:
---

Target Version/s: 1.1.0, 1.0.3

 Inconsistent Kryo serialisation with custom Kryo Registrator
 

 Key: SPARK-2878
 URL: https://issues.apache.org/jira/browse/SPARK-2878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.2
 Environment: Linux RedHat EL 6, 4-node Spark cluster.
Reporter: Graham Dennis
Assignee: Graham Dennis

 The custom Kryo Registrator (a class with the 
 org.apache.spark.serializer.KryoRegistrator trait) is not used with every 
 Kryo instance created, and this causes inconsistent serialisation and 
 deserialisation.
 The Kryo Registrator is sometimes not used because of a ClassNotFound 
 exception that only occurs if it *isn't* the Worker thread (of an Executor) 
 that tries to create the KryoRegistrator.
 A complete description of the problem and a project reproducing the problem 
 can be found at https://github.com/GrahamDennis/spark-kryo-serialisation
 I have currently only tested this with Spark 1.0.0, but will try to test 
 against 1.0.2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate

2014-08-14 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096689#comment-14096689
 ] 

Patrick Wendell commented on SPARK-3028:


I think we intentionally do not intend to ever serialize an update event as 
JSON.

However, I think we should be more explicit about this in two ways.

1. In EventLoggingListener we should explicit put a no-op implementation of 
onExecutorMetricsUpdate with a comment - I actually thought one version of the 
PR had this, but maybe I'm misremembering.

2. We should also include the SparkListenerExecutorMetricsUpdate in the match 
block and just have a no-op there also with a comment.


 sparkEventToJson should support SparkListenerExecutorMetricsUpdate
 --

 Key: SPARK-3028
 URL: https://issues.apache.org/jira/browse/SPARK-3028
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin
Priority: Blocker

 SparkListenerExecutorMetricsUpdate was added without updating 
 org.apache.spark.util.JsonProtocol.sparkEventToJson.
 This can crash the listener.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-08-14 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096688#comment-14096688
 ] 

Saisai Shao commented on SPARK-2926:


I think this prototype can easily offer the functionality SPARK-2978 needed.

 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
 --

 Key: SPARK-2926
 URL: https://issues.apache.org/jira/browse/SPARK-2926
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: Saisai Shao
 Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report.pdf


 Currently Spark has already integrated sort-based shuffle write, which 
 greatly improve the IO performance and reduce the memory consumption when 
 reducer number is very large. But for the reducer side, it still adopts the 
 implementation of hash-based shuffle reader, which neglects the ordering 
 attributes of map output data in some situations.
 Here we propose a MR style sort-merge like shuffle reader for sort-based 
 shuffle to better improve the performance of sort-based shuffle.
 Working in progress code and performance test report will be posted later 
 when some unit test bugs are fixed.
 Any comments would be greatly appreciated. 
 Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate

2014-08-14 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096689#comment-14096689
 ] 

Patrick Wendell edited comment on SPARK-3028 at 8/14/14 7:20 AM:
-

I think we currently do not intend to ever serialize an update event as JSON.

However, I think we should be more explicit about this in two ways.

1. In EventLoggingListener we should explicit put a no-op implementation of 
onExecutorMetricsUpdate with a comment - I actually thought one version of the 
PR had this, but maybe I'm misremembering.

2. We should also include the SparkListenerExecutorMetricsUpdate in the match 
block and just have a no-op there also with a comment.



was (Author: pwendell):
I think we explicitly do not intend to ever serialize an update event as JSON.

However, I think we should be more explicit about this in two ways.

1. In EventLoggingListener we should explicit put a no-op implementation of 
onExecutorMetricsUpdate with a comment - I actually thought one version of the 
PR had this, but maybe I'm misremembering.

2. We should also include the SparkListenerExecutorMetricsUpdate in the match 
block and just have a no-op there also with a comment.


 sparkEventToJson should support SparkListenerExecutorMetricsUpdate
 --

 Key: SPARK-3028
 URL: https://issues.apache.org/jira/browse/SPARK-3028
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin
Priority: Blocker

 SparkListenerExecutorMetricsUpdate was added without updating 
 org.apache.spark.util.JsonProtocol.sparkEventToJson.
 This can crash the listener.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate

2014-08-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096696#comment-14096696
 ] 

Reynold Xin commented on SPARK-3028:


Ok in that case, you suggestion makes sense

 sparkEventToJson should support SparkListenerExecutorMetricsUpdate
 --

 Key: SPARK-3028
 URL: https://issues.apache.org/jira/browse/SPARK-3028
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin
Priority: Blocker

 SparkListenerExecutorMetricsUpdate was added without updating 
 org.apache.spark.util.JsonProtocol.sparkEventToJson.
 This can crash the listener.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()

2014-08-14 Thread Xu Zhongxing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xu Zhongxing updated SPARK-3005:


Comment: was deleted

(was: A related question: why does fined-grain mode and coarse-grained mode 
perform differently? Neither of them implement the killTask() method.)

 Spark with Mesos fine-grained mode throws UnsupportedOperationException in 
 MesosSchedulerBackend.killTask()
 ---

 Key: SPARK-3005
 URL: https://issues.apache.org/jira/browse/SPARK-3005
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector
Reporter: Xu Zhongxing
 Attachments: SPARK-3005_1.diff


 I am using Spark, Mesos, spark-cassandra-connector to do some work on a 
 cassandra cluster.
 During the job running, I killed the Cassandra daemon to simulate some 
 failure cases. This results in task failures.
 If I run the job in Mesos coarse-grained mode, the spark driver program 
 throws an exception and shutdown cleanly.
 But when I run the job in Mesos fine-grained mode, the spark driver program 
 hangs.
 The spark log is: 
 {code}
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 
 Logging.scala (line 58) Cancelling stage 1
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 
 Logging.scala (line 79) Could not cancel tasks for stage 1
 java.lang.UnsupportedOperationException
   at 
 org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 

[jira] [Updated] (SPARK-2878) Inconsistent Kryo serialisation with custom Kryo Registrator

2014-08-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2878:
---

Priority: Critical  (was: Major)

 Inconsistent Kryo serialisation with custom Kryo Registrator
 

 Key: SPARK-2878
 URL: https://issues.apache.org/jira/browse/SPARK-2878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.2
 Environment: Linux RedHat EL 6, 4-node Spark cluster.
Reporter: Graham Dennis
Assignee: Graham Dennis
Priority: Critical

 The custom Kryo Registrator (a class with the 
 org.apache.spark.serializer.KryoRegistrator trait) is not used with every 
 Kryo instance created, and this causes inconsistent serialisation and 
 deserialisation.
 The Kryo Registrator is sometimes not used because of a ClassNotFound 
 exception that only occurs if it *isn't* the Worker thread (of an Executor) 
 that tries to create the KryoRegistrator.
 A complete description of the problem and a project reproducing the problem 
 can be found at https://github.com/GrahamDennis/spark-kryo-serialisation
 I have currently only tested this with Spark 1.0.0, but will try to test 
 against 1.0.2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()

2014-08-14 Thread Xu Zhongxing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095288#comment-14095288
 ] 

Xu Zhongxing edited comment on SPARK-3005 at 8/14/14 7:36 AM:
--

Some additional driver logs during the spark driver hang:
{code}
TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,908 
Logging.scala (line 66) Checking for newly runnable parent stages
 INFO [Result resolver thread-1] 2014-08-13 15:58:15,908 Logging.scala (line 
58) Removed TaskSet 1.0, whose tasks have all completed, from pool 
TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 
Logging.scala (line 66) running: Set(Stage 1, Stage 2)
TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 
Logging.scala (line 66) waiting: Set(Stage 0)
TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 
Logging.scala (line 66) failed: Set()
DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 
Logging.scala (line 62) submitStage(Stage 0)
DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,910 
Logging.scala (line 62) missing: List(Stage 1, Stage 2)
DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,910 
Logging.scala (line 62) submitStage(Stage 1)
DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,910 
Logging.scala (line 62) submitStage(Stage 2)
TRACE [spark-akka.actor.default-dispatcher-3] 2014-08-13 15:58:56,643 
Logging.scala (line 66) Checking for hosts with no recent heart beats in 
BlockManagerMaster.
TRACE [spark-akka.actor.default-dispatcher-6] 2014-08-13 15:59:56,653 
Logging.scala (line 66) Checking for hosts with no recent heart beats in 
BlockManagerMaster.
TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 16:00:56,652 
Logging.scala (line 66) Checking for hosts with no recent heart beats in 
BlockManagerMaster.
{code}


was (Author: xuzhongxing):
Some additional logs during the spark driver hang:

TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,908 
Logging.scala (line 66) Checking for newly runnable parent stages
 INFO [Result resolver thread-1] 2014-08-13 15:58:15,908 Logging.scala (line 
58) Removed TaskSet 1.0, whose tasks have all completed, from pool 
TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 
Logging.scala (line 66) running: Set(Stage 1, Stage 2)
TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 
Logging.scala (line 66) waiting: Set(Stage 0)
TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 
Logging.scala (line 66) failed: Set()
DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 
Logging.scala (line 62) submitStage(Stage 0)
DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,910 
Logging.scala (line 62) missing: List(Stage 1, Stage 2)
DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,910 
Logging.scala (line 62) submitStage(Stage 1)
DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,910 
Logging.scala (line 62) submitStage(Stage 2)
TRACE [spark-akka.actor.default-dispatcher-3] 2014-08-13 15:58:56,643 
Logging.scala (line 66) Checking for hosts with no recent heart beats in 
BlockManagerMaster.
TRACE [spark-akka.actor.default-dispatcher-6] 2014-08-13 15:59:56,653 
Logging.scala (line 66) Checking for hosts with no recent heart beats in 
BlockManagerMaster.
TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 16:00:56,652 
Logging.scala (line 66) Checking for hosts with no recent heart beats in 
BlockManagerMaster.

 Spark with Mesos fine-grained mode throws UnsupportedOperationException in 
 MesosSchedulerBackend.killTask()
 ---

 Key: SPARK-3005
 URL: https://issues.apache.org/jira/browse/SPARK-3005
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector
Reporter: Xu Zhongxing
 Attachments: SPARK-3005_1.diff


 I am using Spark, Mesos, spark-cassandra-connector to do some work on a 
 cassandra cluster.
 During the job running, I killed the Cassandra daemon to simulate some 
 failure cases. This results in task failures.
 If I run the job in Mesos coarse-grained mode, the spark driver program 
 throws an exception and shutdown cleanly.
 But when I run the job in Mesos fine-grained mode, the spark driver program 
 hangs.
 The spark log is: 
 {code}
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 
 Logging.scala (line 58) Cancelling stage 1
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 
 Logging.scala (line 79) Could not cancel tasks for stage 1
 java.lang.UnsupportedOperationException
   

[jira] [Comment Edited] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()

2014-08-14 Thread Xu Zhongxing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096539#comment-14096539
 ] 

Xu Zhongxing edited comment on SPARK-3005 at 8/14/14 7:37 AM:
--

Could adding an empty killTask method to  MesosSchedulerBackend fix this 
problem?
{code:title=MesosSchedulerBackend.scala}
override def killTask(taskId: Long, executorId: String, interruptThread: 
Boolean) {}
{code}
This works for my tests.


was (Author: xuzhongxing):
Could adding an empty killTask method to  MesosSchedulerBackend fix this 
problem?
{code}
override def killTask(taskId: Long, executorId: String, interruptThread: 
Boolean) {}
{code}
This works for my tests.

 Spark with Mesos fine-grained mode throws UnsupportedOperationException in 
 MesosSchedulerBackend.killTask()
 ---

 Key: SPARK-3005
 URL: https://issues.apache.org/jira/browse/SPARK-3005
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector
Reporter: Xu Zhongxing
 Attachments: SPARK-3005_1.diff


 I am using Spark, Mesos, spark-cassandra-connector to do some work on a 
 cassandra cluster.
 During the job running, I killed the Cassandra daemon to simulate some 
 failure cases. This results in task failures.
 If I run the job in Mesos coarse-grained mode, the spark driver program 
 throws an exception and shutdown cleanly.
 But when I run the job in Mesos fine-grained mode, the spark driver program 
 hangs.
 The spark log is: 
 {code}
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 
 Logging.scala (line 58) Cancelling stage 1
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 
 Logging.scala (line 79) Could not cancel tasks for stage 1
 java.lang.UnsupportedOperationException
   at 
 org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at 

[jira] [Created] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-08-14 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-3032:
--

 Summary: Potential bug when running sort-based shuffle with 
sorting using TimSort
 Key: SPARK-3032
 URL: https://issues.apache.org/jira/browse/SPARK-3032
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: Saisai Shao


When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, 
data type for key and value is (String, String), always meet this issue:

{noformat}
java.lang.IllegalArgumentException: Comparison method violates its general 
contract!
at 
org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755)
at 
org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493)
at 
org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420)
at 
org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294)
at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128)
at 
org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
at 
org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
at 
org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
at 
org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
{noformat}

Seems the current partitionKeyComparator which use hashcode of String as key 
comparator break some sorting contracts. 

Also I tested using data type Int as key, this is OK to pass the test, since 
hashcode of Int is its self. So I think potentially partitionDiff + hashcode of 
String may break the sorting contracts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096725#comment-14096725
 ] 

Apache Spark commented on SPARK-3005:
-

User 'xuzhongxing' has created a pull request for this issue:
https://github.com/apache/spark/pull/1940

 Spark with Mesos fine-grained mode throws UnsupportedOperationException in 
 MesosSchedulerBackend.killTask()
 ---

 Key: SPARK-3005
 URL: https://issues.apache.org/jira/browse/SPARK-3005
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector
Reporter: Xu Zhongxing
 Attachments: SPARK-3005_1.diff


 I am using Spark, Mesos, spark-cassandra-connector to do some work on a 
 cassandra cluster.
 During the job running, I killed the Cassandra daemon to simulate some 
 failure cases. This results in task failures.
 If I run the job in Mesos coarse-grained mode, the spark driver program 
 throws an exception and shutdown cleanly.
 But when I run the job in Mesos fine-grained mode, the spark driver program 
 hangs.
 The spark log is: 
 {code}
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 
 Logging.scala (line 58) Cancelling stage 1
  INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 
 Logging.scala (line 79) Could not cancel tasks for stage 1
 java.lang.UnsupportedOperationException
   at 
 org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
   at 
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061)
   at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 

[jira] [Resolved] (SPARK-3029) Disable local execution of Spark jobs by default

2014-08-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3029.


   Resolution: Fixed
Fix Version/s: 1.1.0

 Disable local execution of Spark jobs by default
 

 Key: SPARK-3029
 URL: https://issues.apache.org/jira/browse/SPARK-3029
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Blocker
 Fix For: 1.1.0


 Currently, local execution of Spark jobs is only used by take(), and it can 
 be problematic as it can load a significant amount of data onto the driver. 
 The worst case scenarios occur if the RDD is cached (guaranteed to load whole 
 partition), has very large elements, or the partition is just large and we 
 apply a filter with high selectivity or computational overhead.
 Additionally, jobs that run locally in this manner do not show up in the web 
 UI, and are thus harder to track or understand what is occurring.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3033) java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal

2014-08-14 Thread pengyanhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pengyanhong updated SPARK-3033:
---

Description: 
run a complex HiveQL via yarn-cluster, got error as below:
{quote}
14/08/14 15:05:24 WARN 
org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to 
java.lang.ClassCastException
java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
org.apache.hadoop.hive.common.type.HiveDecimal
at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51)
at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022)
at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306)
at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179)
at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82)
at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276)
at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84)
at 
org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62)
at 
org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309)
at 
org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
{quote}


  was:
14/08/14 15:05:24 WARN 
org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to 
java.lang.ClassCastException
java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
org.apache.hadoop.hive.common.type.HiveDecimal
at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51)
at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022)
at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306)
at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179)
at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82)
at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276)
at 
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84)
at 
org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62)
at 
org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309)
at 
org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
at 

[jira] [Created] (SPARK-3034) java.sql.Date cannot be cast to java.sql.Timestamp

2014-08-14 Thread pengyanhong (JIRA)
pengyanhong created SPARK-3034:
--

 Summary: java.sql.Date cannot be cast to java.sql.Timestamp
 Key: SPARK-3034
 URL: https://issues.apache.org/jira/browse/SPARK-3034
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.2
Reporter: pengyanhong
Priority: Blocker


Exception in thread Thread-2 java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 141 
on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: java.sql.Date 
cannot be cast to java.sql.Timestamp

org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33)

org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251)

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486)

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439)

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Updated] (SPARK-3034) java.sql.Date cannot be cast to java.sql.Timestamp

2014-08-14 Thread pengyanhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pengyanhong updated SPARK-3034:
---

Description: 
run a simple HiveQL via yarn-cluster, got error as below:
{quote}
Exception in thread Thread-2 java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 141 
on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: java.sql.Date 
cannot be cast to java.sql.Timestamp

org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33)

org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251)

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486)

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439)

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

  was:
Exception in thread Thread-2 java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at 

[jira] [Updated] (SPARK-3034) java.sql.Date cannot be cast to java.sql.Timestamp

2014-08-14 Thread pengyanhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pengyanhong updated SPARK-3034:
---

Description: 
run a simple HiveQL via yarn-cluster, got error as below:
{quote}
Exception in thread Thread-2 java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 141 
on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: java.sql.Date 
cannot be cast to java.sql.Timestamp

org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33)

org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251)

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486)

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439)

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)

org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{quote}

above error is thrown in the stage of inserting result data into a hive table 
which has a field of timestamp data type.


  was:
run a simple HiveQL via yarn-cluster, got error as below:
{quote}
Exception in thread Thread-2 java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 

[jira] [Created] (SPARK-3035) Wrong example with SparkContext.addFile

2014-08-14 Thread Daehan Kim (JIRA)
Daehan Kim created SPARK-3035:
-

 Summary: Wrong example with SparkContext.addFile
 Key: SPARK-3035
 URL: https://issues.apache.org/jira/browse/SPARK-3035
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 1.0.2
Reporter: Daehan Kim
Priority: Trivial
 Fix For: 1.0.2


{code:title=context.py}
def addFile(self, path):

...
 from pyspark import SparkFiles
 path = os.path.join(tempdir, test.txt)
 with open(path, w) as testFile:
...testFile.write(100)
 sc.addFile(path)
 def func(iterator):
...with open(SparkFiles.get(test.txt)) as testFile:
...fileVal = int(testFile.readline())
...return [x * 100 for x in iterator]
 sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect()
[100, 200, 300, 400]

{code}

This is example that write 100 to temp file and distribute it and use it's 
value when multiplying values(to see if nodes can read distributed file)

But look this lines, result will never be effected by distributed file:
{code}
...fileVal = int(testFile.readline())
...return [x * 100 for x in iterator]
{code}

I'm sure this code was intended as like this: 
{code}
...fileVal = int(testFile.readline())
...return [x * fileVal for x in iterator]
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2893) Should not swallow exception when cannot find custom Kryo registrator

2014-08-14 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2893.


   Resolution: Fixed
Fix Version/s: 1.0.3
   1.1.0

 Should not swallow exception when cannot find custom Kryo registrator
 -

 Key: SPARK-2893
 URL: https://issues.apache.org/jira/browse/SPARK-2893
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Graham Dennis
Assignee: Graham Dennis
Priority: Blocker
 Fix For: 1.1.0, 1.0.3


 If for any reason a custom Kryo registrator cannot be found, the task ( 
 application) should fail, because otherwise problems could occur later during 
 data serialisation / deserialisation.  See SPARK-2878 for an example.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3036) Add MapType containing null value support to Parquet.

2014-08-14 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-3036:


 Summary: Add MapType containing null value support to Parquet.
 Key: SPARK-3036
 URL: https://issues.apache.org/jira/browse/SPARK-3036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Priority: Blocker


Current Parquet schema for {{MapType}} is as follows regardless of 
{{valueContainsNull}}:

{noformat}
message root {
  optional group a (MAP) {
repeated group map (MAP_KEY_VALUE) {
  required int32 key;
  required int32 value;
}
  }
}
{noformat}

and if the map contains {{null}} value, it throws runtime exception.

To handle {{MapType}} containing {{null}} value, the schema should be as 
follows if {{valueContainsNull}} is {{true}}:

{noformat}
message root {
  optional group a (MAP) {
repeated group map (MAP_KEY_VALUE) {
  required int32 key;
  optional int32 value;
}
  }
}
{noformat}

FYI:
Hive's Parquet writer *always* uses the latter schema, but reader can read from 
both schema.

NOTICE:
This change will break backward compatibility when the schema is read from 
Parquet metadata ({{org.apache.spark.sql.parquet.row.metadata}}).




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3038) delete history server logs when there are too many logs

2014-08-14 Thread wangfei (JIRA)
wangfei created SPARK-3038:
--

 Summary: delete history server logs when there are too many logs 
 Key: SPARK-3038
 URL: https://issues.apache.org/jira/browse/SPARK-3038
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: wangfei
 Fix For: 1.1.0


enhance history server to delete logs automatically
1 use spark.history.deletelogs.enable to enable this function
2 when app logs num is greater than spark.history.maxsavedapplication, delete 
the old logs 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3038) delete history server logs when there are too many logs

2014-08-14 Thread wangfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-3038:
---

Description: 
enhance history server to delete logs automatically
1 use spark.history.deletelogs.enable to enable this function
2 when app logs num is greater than spark.history.maxsavedapplication, delete 
the older logs 

  was:
enhance history server to delete logs automatically
1 use spark.history.deletelogs.enable to enable this function
2 when app logs num is greater than spark.history.maxsavedapplication, delete 
the old logs 


 delete history server logs when there are too many logs 
 

 Key: SPARK-3038
 URL: https://issues.apache.org/jira/browse/SPARK-3038
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: wangfei
 Fix For: 1.1.0


 enhance history server to delete logs automatically
 1 use spark.history.deletelogs.enable to enable this function
 2 when app logs num is greater than spark.history.maxsavedapplication, delete 
 the older logs 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2014-08-14 Thread Bertrand Bossy (JIRA)
Bertrand Bossy created SPARK-3039:
-

 Summary: Spark assembly for new hadoop API (hadoop 2) contains 
avro-mapred for hadoop 1 API
 Key: SPARK-3039
 URL: https://issues.apache.org/jira/browse/SPARK-3039
 Project: Spark
  Issue Type: Bug
  Components: Build, Input/Output, Spark Core
Affects Versions: 1.0.0, 0.9.1
 Environment: hadoop2, hadoop-2.4.0, HDP-2.1
Reporter: Bertrand Bossy


The spark assembly contains the artifact org.apache.avro:avro-mapred as a 
dependency of org.spark-project.hive:hive-serde.

The avro-mapred package provides a hadoop FileInputFormat to read and write 
avro files. There are two versions of this package, distinguished by a 
classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. 
avro-mapred for the old Hadoop API uses no classifier.

E.g. when reading avro files using 
{code}
sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro)
{code}

The following error occurs:
{code}
java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at 
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}

This error usually is a hint that there was a mix up of the old and the new 
Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to appear 
before the version that is bundled with Spark, reading avro files works fine. 

Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.







--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3026) Provide a good error message if JDBC server is used but Spark is not compiled with -Pthriftserver

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096961#comment-14096961
 ] 

Apache Spark commented on SPARK-3026:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/1944

 Provide a good error message if JDBC server is used but Spark is not compiled 
 with -Pthriftserver
 -

 Key: SPARK-3026
 URL: https://issues.apache.org/jira/browse/SPARK-3026
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Patrick Wendell
Assignee: Cheng Lian
Priority: Critical

 Instead of giving a ClassNotFoundException we should detect this case and 
 just tell the user to build with -Phiveserver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2014-08-14 Thread Bertrand Bossy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bertrand Bossy updated SPARK-3039:
--

Affects Version/s: 1.1.0

 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
 1 API
 --

 Key: SPARK-3039
 URL: https://issues.apache.org/jira/browse/SPARK-3039
 Project: Spark
  Issue Type: Bug
  Components: Build, Input/Output, Spark Core
Affects Versions: 0.9.1, 1.0.0, 1.1.0
 Environment: hadoop2, hadoop-2.4.0, HDP-2.1
Reporter: Bertrand Bossy

 The spark assembly contains the artifact org.apache.avro:avro-mapred as a 
 dependency of org.spark-project.hive:hive-serde.
 The avro-mapred package provides a hadoop FileInputFormat to read and write 
 avro files. There are two versions of this package, distinguished by a 
 classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. 
 avro-mapred for the old Hadoop API uses no classifier.
 E.g. when reading avro files using 
 {code}
 sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro)
 {code}
 The following error occurs:
 {code}
 java.lang.IncompatibleClassChangeError: Found interface 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at 
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 This error usually is a hint that there was a mix up of the old and the new 
 Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to 
 appear before the version that is bundled with Spark, reading avro files 
 works fine. 
 Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3034) [HIve] java.sql.Date cannot be cast to java.sql.Timestamp

2014-08-14 Thread pengyanhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pengyanhong updated SPARK-3034:
---

Summary: [HIve] java.sql.Date cannot be cast to java.sql.Timestamp  (was: 
java.sql.Date cannot be cast to java.sql.Timestamp)

 [HIve] java.sql.Date cannot be cast to java.sql.Timestamp
 -

 Key: SPARK-3034
 URL: https://issues.apache.org/jira/browse/SPARK-3034
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.2
Reporter: pengyanhong
Priority: Blocker

 run a simple HiveQL via yarn-cluster, got error as below:
 {quote}
 Exception in thread Thread-2 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199)
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 
 141 on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: 
 java.sql.Date cannot be cast to java.sql.Timestamp
 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33)
 
 org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439)
 
 org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   

[jira] [Updated] (SPARK-3033) [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal

2014-08-14 Thread pengyanhong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pengyanhong updated SPARK-3033:
---

Summary: [Hive] java.math.BigDecimal cannot be cast to 
org.apache.hadoop.hive.common.type.HiveDecimal  (was: java.math.BigDecimal 
cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal)

 [Hive] java.math.BigDecimal cannot be cast to 
 org.apache.hadoop.hive.common.type.HiveDecimal
 

 Key: SPARK-3033
 URL: https://issues.apache.org/jira/browse/SPARK-3033
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.2
Reporter: pengyanhong
Priority: Blocker

 run a complex HiveQL via yarn-cluster, got error as below:
 {quote}
 14/08/14 15:05:24 WARN 
 org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to 
 java.lang.ClassCastException
 java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
 org.apache.hadoop.hive.common.type.HiveDecimal
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022)
   at 
 org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306)
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179)
   at 
 org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82)
   at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84)
   at 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62)
   at 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309)
   at 
 org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303)
   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096973#comment-14096973
 ] 

Apache Spark commented on SPARK-3039:
-

User 'bbossy' has created a pull request for this issue:
https://github.com/apache/spark/pull/1945

 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
 1 API
 --

 Key: SPARK-3039
 URL: https://issues.apache.org/jira/browse/SPARK-3039
 Project: Spark
  Issue Type: Bug
  Components: Build, Input/Output, Spark Core
Affects Versions: 0.9.1, 1.0.0, 1.1.0
 Environment: hadoop2, hadoop-2.4.0, HDP-2.1
Reporter: Bertrand Bossy

 The spark assembly contains the artifact org.apache.avro:avro-mapred as a 
 dependency of org.spark-project.hive:hive-serde.
 The avro-mapred package provides a hadoop FileInputFormat to read and write 
 avro files. There are two versions of this package, distinguished by a 
 classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. 
 avro-mapred for the old Hadoop API uses no classifier.
 E.g. when reading avro files using 
 {code}
 sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro)
 {code}
 The following error occurs:
 {code}
 java.lang.IncompatibleClassChangeError: Found interface 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at 
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 This error usually is a hint that there was a mix up of the old and the new 
 Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to 
 appear before the version that is bundled with Spark, reading avro files 
 works fine. 
 Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3040) pick up a more proper local ip address for Utils.findLocalIpAddress method

2014-08-14 Thread Ye Xianjin (JIRA)
Ye Xianjin created SPARK-3040:
-

 Summary: pick up a more proper local ip address for 
Utils.findLocalIpAddress method
 Key: SPARK-3040
 URL: https://issues.apache.org/jira/browse/SPARK-3040
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Mac os x, a bunch of network interfaces: eth0, wlan0, 
vnic0, vnic1, tun0, lo
Reporter: Ye Xianjin
Priority: Trivial


I noticed this inconvenience when I ran spark-shell with my virtual machines on 
and VPN service running.

There are a lot of network interfaces on my laptop(inactive devices omitted):
{quote}
lo0: inet 127.0.0.1
en1: inet 192.168.0.102
vnic0: inet 10.211.55.2 (virtual if for vm1)
vnic1: inet 10.37.129.3 (virtual if for vm2)
tun0: inet 172.16.100.191 -- 172.16.100.191 (tun device for VPN)
{quote}

In spark core, Utils.findLocalIpAddress() uses 
NetworkInterface.getNetworkInterfaces to get all active network interfaces, but 
unfortunately, this method returns network interfaces in reverse order compared 
to the ifconfig output (both use ioctl sys call). I dug into the openJDK 6 and 
7 source code and confirms this behavior(It just happens on unix-like system, 
windows deals with it and returns in index order). So, the findLocalIpAddress 
method will pick the ip address associated with tun0 rather than en1




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3040) pick up a more proper local ip address for Utils.findLocalIpAddress method

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097115#comment-14097115
 ] 

Apache Spark commented on SPARK-3040:
-

User 'advancedxy' has created a pull request for this issue:
https://github.com/apache/spark/pull/1946

 pick up a more proper local ip address for Utils.findLocalIpAddress method
 --

 Key: SPARK-3040
 URL: https://issues.apache.org/jira/browse/SPARK-3040
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Mac os x, a bunch of network interfaces: eth0, wlan0, 
 vnic0, vnic1, tun0, lo
Reporter: Ye Xianjin
Priority: Trivial
   Original Estimate: 1h
  Remaining Estimate: 1h

 I noticed this inconvenience when I ran spark-shell with my virtual machines 
 on and VPN service running.
 There are a lot of network interfaces on my laptop(inactive devices omitted):
 {quote}
 lo0: inet 127.0.0.1
 en1: inet 192.168.0.102
 vnic0: inet 10.211.55.2 (virtual if for vm1)
 vnic1: inet 10.37.129.3 (virtual if for vm2)
 tun0: inet 172.16.100.191 -- 172.16.100.191 (tun device for VPN)
 {quote}
 In spark core, Utils.findLocalIpAddress() uses 
 NetworkInterface.getNetworkInterfaces to get all active network interfaces, 
 but unfortunately, this method returns network interfaces in reverse order 
 compared to the ifconfig output (both use ioctl sys call). I dug into the 
 openJDK 6 and 7 source code and confirms this behavior(It just happens on 
 unix-like system, windows deals with it and returns in index order). So, the 
 findLocalIpAddress method will pick the ip address associated with tun0 
 rather than en1



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate

2014-08-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097193#comment-14097193
 ] 

Sandy Ryza commented on SPARK-3028:
---

+1 to what Patrick said.  I'll post a patch along those lines.

 sparkEventToJson should support SparkListenerExecutorMetricsUpdate
 --

 Key: SPARK-3028
 URL: https://issues.apache.org/jira/browse/SPARK-3028
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin
Priority: Blocker

 SparkListenerExecutorMetricsUpdate was added without updating 
 org.apache.spark.util.JsonProtocol.sparkEventToJson.
 This can crash the listener.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3009) ApplicationInfo doesn't get initialised after deserialisation during recovery

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097268#comment-14097268
 ] 

Apache Spark commented on SPARK-3009:
-

User 'jacek-lewandowski' has created a pull request for this issue:
https://github.com/apache/spark/pull/1947

 ApplicationInfo doesn't get initialised after deserialisation during recovery
 -

 Key: SPARK-3009
 URL: https://issues.apache.org/jira/browse/SPARK-3009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Jacek Lewandowski

 The {{readObject}} method has been removed from {{ApplicationInfo}} so that 
 it does not initialise its transient fields properly after deserialisation. 
 It follows throwing NPE during recovery of an application in 
 {{MetricSystem.registerSource}}. As [~andrewor14] said, he removed 
 {{readObject}} method by accident. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2927) Add a conf to configure if we always read Binary columns stored in Parquet as String columns

2014-08-14 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2927.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 Add a conf to configure if we always read Binary columns stored in Parquet as 
 String columns
 

 Key: SPARK-2927
 URL: https://issues.apache.org/jira/browse/SPARK-2927
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.1.0


 Based on Parquet spec (https://github.com/Parquet/parquet-format), strings 
 are stored as byte arrays (binary) with a UTF8 annotation. However, if the 
 data generator does not follow it, we will only read binary values back 
 instead of string values.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3011) _temporary directory should be filtered out by sqlContext.parquetFile

2014-08-14 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3011.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 _temporary directory should be filtered out by sqlContext.parquetFile
 -

 Key: SPARK-3011
 URL: https://issues.apache.org/jira/browse/SPARK-3011
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Joseph Su
 Fix For: 1.1.0


 Sometimes _temporary directory is not removed after the file committed on S3. 
 sqlContext.parquetFile will raise because it is trying to read the metadata 
 in  _temporary .sqlContext.parquetFile should just ignore the directory.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3041) DecisionTree: isSampleValid indexing incorrect

2014-08-14 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-3041:


 Summary: DecisionTree: isSampleValid indexing incorrect
 Key: SPARK-3041
 URL: https://issues.apache.org/jira/browse/SPARK-3041
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Joseph K. Bradley


In DecisionTree, isSampleValid treats unordered categorical features 
incorrectly: It treated the bins as if indexed by featured values, rather than 
by subsets of values/categories.
This bug is exhibited for unordered features (multi-class classification with 
categorical features of low arity).
Proposed fix: Index bins correctly for unordered categorical features.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3043) DecisionTree aggregation is inefficient

2014-08-14 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-3043:


 Summary: DecisionTree aggregation is inefficient
 Key: SPARK-3043
 URL: https://issues.apache.org/jira/browse/SPARK-3043
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley


2 major efficiency issues in computation and storage:

(1) DecisionTree aggregation involves reshaping data unnecessarily.

E.g., the internal methods extractNodeInfo() and getBinDataForNode() involve 
reshaping the data multiple times without real computation.

(2) DecisionTree splits and aggregate bins can include many unused bins/splits.

The same number of splits/bins are used for all features.  E.g., if there is a 
continuous feature which uses 100 bins, then there will also be 100 bins 
allocated for all binary features, even though only 2 are necessary.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3042) DecisionTree filtering is very inefficient

2014-08-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3042:
-

Assignee: Joseph K. Bradley

 DecisionTree filtering is very inefficient
 --

 Key: SPARK-3042
 URL: https://issues.apache.org/jira/browse/SPARK-3042
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 DecisionTree needs to match each example to a node at each iteration.  It 
 currently does this with a set of filters very inefficiently: For each 
 example, it examines each node at the current level and traces up to the root 
 to see if that example should be handled by that node.
 Proposed fix: Filter top-down using the partly built tree itself.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3043) DecisionTree aggregation is inefficient

2014-08-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3043:
-

Assignee: Joseph K. Bradley

 DecisionTree aggregation is inefficient
 ---

 Key: SPARK-3043
 URL: https://issues.apache.org/jira/browse/SPARK-3043
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 2 major efficiency issues in computation and storage:
 (1) DecisionTree aggregation involves reshaping data unnecessarily.
 E.g., the internal methods extractNodeInfo() and getBinDataForNode() involve 
 reshaping the data multiple times without real computation.
 (2) DecisionTree splits and aggregate bins can include many unused 
 bins/splits.
 The same number of splits/bins are used for all features.  E.g., if there is 
 a continuous feature which uses 100 bins, then there will also be 100 bins 
 allocated for all binary features, even though only 2 are necessary.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3042) DecisionTree filtering is very inefficient

2014-08-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3042:
-

Target Version/s: 1.1.0

 DecisionTree filtering is very inefficient
 --

 Key: SPARK-3042
 URL: https://issues.apache.org/jira/browse/SPARK-3042
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 DecisionTree needs to match each example to a node at each iteration.  It 
 currently does this with a set of filters very inefficiently: For each 
 example, it examines each node at the current level and traces up to the root 
 to see if that example should be handled by that node.
 Proposed fix: Filter top-down using the partly built tree itself.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3041) DecisionTree: isSampleValid indexing incorrect

2014-08-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3041:
-

Assignee: Joseph K. Bradley

 DecisionTree: isSampleValid indexing incorrect
 --

 Key: SPARK-3041
 URL: https://issues.apache.org/jira/browse/SPARK-3041
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 In DecisionTree, isSampleValid treats unordered categorical features 
 incorrectly: It treated the bins as if indexed by featured values, rather 
 than by subsets of values/categories.
 This bug is exhibited for unordered features (multi-class classification with 
 categorical features of low arity).
 Proposed fix: Index bins correctly for unordered categorical features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3043) DecisionTree aggregation is inefficient

2014-08-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3043:
-

 Target Version/s: 1.1.0
Affects Version/s: 1.1.0

 DecisionTree aggregation is inefficient
 ---

 Key: SPARK-3043
 URL: https://issues.apache.org/jira/browse/SPARK-3043
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 2 major efficiency issues in computation and storage:
 (1) DecisionTree aggregation involves reshaping data unnecessarily.
 E.g., the internal methods extractNodeInfo() and getBinDataForNode() involve 
 reshaping the data multiple times without real computation.
 (2) DecisionTree splits and aggregate bins can include many unused 
 bins/splits.
 The same number of splits/bins are used for all features.  E.g., if there is 
 a continuous feature which uses 100 bins, then there will also be 100 bins 
 allocated for all binary features, even though only 2 are necessary.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3041) DecisionTree: isSampleValid indexing incorrect

2014-08-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3041:
-

 Target Version/s: 1.1.0
Affects Version/s: 1.1.0

 DecisionTree: isSampleValid indexing incorrect
 --

 Key: SPARK-3041
 URL: https://issues.apache.org/jira/browse/SPARK-3041
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 In DecisionTree, isSampleValid treats unordered categorical features 
 incorrectly: It treated the bins as if indexed by featured values, rather 
 than by subsets of values/categories.
 This bug is exhibited for unordered features (multi-class classification with 
 categorical features of low arity).
 Proposed fix: Index bins correctly for unordered categorical features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3044) Create RSS feed for Spark News

2014-08-14 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-3044:
---

 Summary: Create RSS feed for Spark News
 Key: SPARK-3044
 URL: https://issues.apache.org/jira/browse/SPARK-3044
 Project: Spark
  Issue Type: Documentation
Reporter: Nicholas Chammas
Priority: Minor


Project updates are often posted here: http://spark.apache.org/news/

Currently, there is no way to subscribe to a feed of these updates. It would be 
nice there was a way people could be notified of new posts there without having 
to check manually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2979) Improve the convergence rate by minimizing the condition number in LOR with LBFGS

2014-08-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2979:
-

Assignee: DB Tsai

 Improve the convergence rate by minimizing the condition number in LOR with 
 LBFGS
 -

 Key: SPARK-2979
 URL: https://issues.apache.org/jira/browse/SPARK-2979
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai
Assignee: DB Tsai

 Scaling to minimize the condition number:
 
 During the optimization process, the convergence (rate) depends on the 
 condition number of the training dataset. Scaling the variables often reduces 
 this condition number, thus mproving the convergence rate dramatically. 
 Without reducing the condition number, some training datasets mixing the 
 columns with different scales may not be able to converge.
  
 GLMNET and LIBSVM packages perform the scaling to reduce the condition 
 number, and return the weights in the original scale.
 See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
  
 Here, if useFeatureScaling is enabled, we will standardize the training 
 features by dividing the variance of each column (without subtracting the 
 mean), and train the model in the scaled space. Then we transform the 
 coefficients from the scaled space to the original scale as GLMNET and LIBSVM 
 do.

 Currently, it's only enabled in LogisticRegressionWithLBFGS



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2979) Improve the convergence rate by minimizing the condition number in LOR with LBFGS

2014-08-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2979.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1897
[https://github.com/apache/spark/pull/1897]

 Improve the convergence rate by minimizing the condition number in LOR with 
 LBFGS
 -

 Key: SPARK-2979
 URL: https://issues.apache.org/jira/browse/SPARK-2979
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai
Assignee: DB Tsai
 Fix For: 1.1.0


 Scaling to minimize the condition number:
 
 During the optimization process, the convergence (rate) depends on the 
 condition number of the training dataset. Scaling the variables often reduces 
 this condition number, thus mproving the convergence rate dramatically. 
 Without reducing the condition number, some training datasets mixing the 
 columns with different scales may not be able to converge.
  
 GLMNET and LIBSVM packages perform the scaling to reduce the condition 
 number, and return the weights in the original scale.
 See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
  
 Here, if useFeatureScaling is enabled, we will standardize the training 
 features by dividing the variance of each column (without subtracting the 
 mean), and train the model in the scaled space. Then we transform the 
 coefficients from the scaled space to the original scale as GLMNET and LIBSVM 
 do.

 Currently, it's only enabled in LogisticRegressionWithLBFGS



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3046) Set executor's class loader as the default serializer class loader

2014-08-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3046:
--

 Summary: Set executor's class loader as the default serializer 
class loader
 Key: SPARK-3046
 URL: https://issues.apache.org/jira/browse/SPARK-3046
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Reynold Xin


This is an attempt to fix the problem outlined in SPARK-2878.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3045) Make Serializer interface Java friendly

2014-08-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3045:
--

 Summary: Make Serializer interface Java friendly
 Key: SPARK-3045
 URL: https://issues.apache.org/jira/browse/SPARK-3045
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Reynold Xin


The various Scheduler* traits have default implementations in them, which makes 
them hard to extend in Java.

We should turn those traits into AbstractClass instead. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3045) Make Serializer interface Java friendly

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097503#comment-14097503
 ] 

Apache Spark commented on SPARK-3045:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1948

 Make Serializer interface Java friendly
 ---

 Key: SPARK-3045
 URL: https://issues.apache.org/jira/browse/SPARK-3045
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Reynold Xin

 The various Scheduler* traits have default implementations in them, which 
 makes them hard to extend in Java.
 We should turn those traits into AbstractClass instead. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3046) Set executor's class loader as the default serializer class loader

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097504#comment-14097504
 ] 

Apache Spark commented on SPARK-3046:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1948

 Set executor's class loader as the default serializer class loader
 --

 Key: SPARK-3046
 URL: https://issues.apache.org/jira/browse/SPARK-3046
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Reynold Xin

 This is an attempt to fix the problem outlined in SPARK-2878.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3047) Use utf-8 for textFile() by default

2014-08-14 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3047:
-

 Summary: Use utf-8 for textFile() by default
 Key: SPARK-3047
 URL: https://issues.apache.org/jira/browse/SPARK-3047
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


In Python 2.x, most of the string are bytearray, it's more efficient then 
unicode (both cpu and memory).

UTF-8 is the default encoding.

After disable decode from utf8 into unicode in UTF8Deserializer, the total time 
for wc job is reduce by 32% (from 2m17s to 1m34s).

We also could add argument unicode=False to textFile().



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3011) _temporary directory should be filtered out by sqlContext.parquetFile

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097537#comment-14097537
 ] 

Apache Spark commented on SPARK-3011:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/1949

 _temporary directory should be filtered out by sqlContext.parquetFile
 -

 Key: SPARK-3011
 URL: https://issues.apache.org/jira/browse/SPARK-3011
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Joseph Su
 Fix For: 1.1.0


 Sometimes _temporary directory is not removed after the file committed on S3. 
 sqlContext.parquetFile will raise because it is trying to read the metadata 
 in  _temporary .sqlContext.parquetFile should just ignore the directory.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3041) DecisionTree: isSampleValid indexing incorrect

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097564#comment-14097564
 ] 

Apache Spark commented on SPARK-3041:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/1950

 DecisionTree: isSampleValid indexing incorrect
 --

 Key: SPARK-3041
 URL: https://issues.apache.org/jira/browse/SPARK-3041
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 In DecisionTree, isSampleValid treats unordered categorical features 
 incorrectly: It treated the bins as if indexed by featured values, rather 
 than by subsets of values/categories.
 This bug is exhibited for unordered features (multi-class classification with 
 categorical features of low arity).
 Proposed fix: Index bins correctly for unordered categorical features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3022) FindBinsForLevel in decision tree should call findBin only once for each feature

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097563#comment-14097563
 ] 

Apache Spark commented on SPARK-3022:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/1950

 FindBinsForLevel in decision tree should call findBin only once for each 
 feature
 

 Key: SPARK-3022
 URL: https://issues.apache.org/jira/browse/SPARK-3022
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Qiping Li
   Original Estimate: 4h
  Remaining Estimate: 4h

 `findbinsForLevel` is applied to every `LabeledPoint` to find bins for all 
 nodes at a given level. Given a specific `LabeledPoint` and a specific 
 feature, the bin to put this labeled point should always be same.But in 
 current implementation, `findBin` on a (labeledpoint, feature) pair is called 
 for every node at a given level, which is a waste of computation. I proposed 
 to call `findBin` only once and if a `LabeledPoint` is valid on a node, this 
 result can be reused.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3047) add an option to use str in textFileRDD()

2014-08-14 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-3047:
--

Summary: add an option to use str in textFileRDD()  (was: Use utf-8 for 
textFile() by default)

 add an option to use str in textFileRDD()
 -

 Key: SPARK-3047
 URL: https://issues.apache.org/jira/browse/SPARK-3047
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu

 In Python 2.x, most of the string are bytearray, it's more efficient then 
 unicode (both cpu and memory).
 UTF-8 is the default encoding.
 After disable decode from utf8 into unicode in UTF8Deserializer, the total 
 time for wc job is reduce by 32% (from 2m17s to 1m34s).
 We also could add argument unicode=False to textFile().



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3047) add an option to use str in textFileRDD()

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097639#comment-14097639
 ] 

Apache Spark commented on SPARK-3047:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/1951

 add an option to use str in textFileRDD()
 -

 Key: SPARK-3047
 URL: https://issues.apache.org/jira/browse/SPARK-3047
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu

 In Python 2.x, most of the string are bytearray, it's more efficient then 
 unicode (both cpu and memory).
 UTF-8 is the default encoding.
 After disable decode from utf8 into unicode in UTF8Deserializer, the total 
 time for wc job is reduce by 32% (from 2m17s to 1m34s).
 We also could add argument unicode=False to textFile().



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2736) Create PySpark RDD from Apache Avro File

2014-08-14 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2736:
-

Priority: Major  (was: Minor)

 Create PySpark RDD from Apache Avro File
 

 Key: SPARK-2736
 URL: https://issues.apache.org/jira/browse/SPARK-2736
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Eric Garcia
Assignee: Kan Zhang
   Original Estimate: 4h
  Remaining Estimate: 4h

 There is a partially working example Avro Converter at this pull request: 
 https://github.com/apache/spark/pull/1536
 It does not fully implement all types in the Avro format and could be cleaned 
 up a little bit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2736) Create PySpark RDD from Apache Avro File

2014-08-14 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097701#comment-14097701
 ] 

Matei Zaharia commented on SPARK-2736:
--

I bumped this up to Major because the PR also contains an important bug fix.

 Create PySpark RDD from Apache Avro File
 

 Key: SPARK-2736
 URL: https://issues.apache.org/jira/browse/SPARK-2736
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Eric Garcia
Assignee: Kan Zhang
   Original Estimate: 4h
  Remaining Estimate: 4h

 There is a partially working example Avro Converter at this pull request: 
 https://github.com/apache/spark/pull/1536
 It does not fully implement all types in the Avro format and could be cleaned 
 up a little bit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3048) Make LabeledPointParser public

2014-08-14 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3048:


 Summary: Make LabeledPointParser public
 Key: SPARK-3048
 URL: https://issues.apache.org/jira/browse/SPARK-3048
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


`LabeledPointParser` is used in `MLUtils.loadLabeledPoint`. Making it public 
may be useful to parse labeled points from other data sources, e.g., DStream. 
Now we have a util function called `MLUtils.loadStreamingLabeledPoints`, which 
increases the complexity of the API. I really want to make `parse` as a method 
of `LabeledPoint$` instead of creating a new class `LabeledPointParser`, but it 
is hard to extend a case class with a static method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3049) Make sure client doesn't block when server/connection has error(s)

2014-08-14 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3049:
--

 Summary: Make sure client doesn't block when server/connection has 
error(s)
 Key: SPARK-3049
 URL: https://issues.apache.org/jira/browse/SPARK-3049
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3048) Make LabeledPointParser public

2014-08-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3048:
-

Description: `LabeledPointParser` is used in `MLUtils.loadLabeledPoint`. 
Making it public may be useful to parse labeled points from other data sources, 
e.g., DStream. Now we have a util function called 
`MLUtils.loadStreamingLabeledPoints`, which increases the complexity of the 
API. I really want to make `parse` as a method of `LabeledPoint$` instead of 
creating a new class `LabeledPointParser`, but it is hard to extend a case 
class with a static method in a binary compatible way.  (was: 
`LabeledPointParser` is used in `MLUtils.loadLabeledPoint`. Making it public 
may be useful to parse labeled points from other data sources, e.g., DStream. 
Now we have a util function called `MLUtils.loadStreamingLabeledPoints`, which 
increases the complexity of the API. I really want to make `parse` as a method 
of `LabeledPoint$` instead of creating a new class `LabeledPointParser`, but it 
is hard to extend a case class with a static method.)

 Make LabeledPointParser public
 --

 Key: SPARK-3048
 URL: https://issues.apache.org/jira/browse/SPARK-3048
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 `LabeledPointParser` is used in `MLUtils.loadLabeledPoint`. Making it public 
 may be useful to parse labeled points from other data sources, e.g., DStream. 
 Now we have a util function called `MLUtils.loadStreamingLabeledPoints`, 
 which increases the complexity of the API. I really want to make `parse` as a 
 method of `LabeledPoint$` instead of creating a new class 
 `LabeledPointParser`, but it is hard to extend a case class with a static 
 method in a binary compatible way.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3050) Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster

2014-08-14 Thread Mingyu Kim (JIRA)
Mingyu Kim created SPARK-3050:
-

 Summary: Spark program running with 1.0.2 jar cannot run against a 
1.0.1 cluster
 Key: SPARK-3050
 URL: https://issues.apache.org/jira/browse/SPARK-3050
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
Reporter: Mingyu Kim
Priority: Critical


I ran the following code with Spark 1.0.2 jar against a cluster that runs Spark 
1.0.1 (i.e. localhost:7077 is running 1.0.1).

{code}
import java.util.ArrayList;
import java.util.List;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class TestTest {

public static void main(String[] args) {
JavaSparkContext sc = new JavaSparkContext(spark://localhost:7077, 
Test);
ListInteger list = new ArrayList();
list.add(1);
list.add(2);
list.add(3);
JavaRDDInteger rdd = sc.parallelize(list);
System.out.println(rdd.collect());
}

}
{code}

This throws InvalidClassException.

{code}
Exception in thread main org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0.0:1 failed 4 times, most recent failure: Exception 
failure in TID 6 on host 10.100.91.90: java.io.InvalidClassException: 
org.apache.spark.rdd.RDD; local class incompatible: stream classdesc 
serialVersionUID = -6766554341038829528, local class serialVersionUID = 
385418487991259089
java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:604)
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)

org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1835)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1794)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)

org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:722)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 

[jira] [Commented] (SPARK-3048) Make LabeledPointParser public

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097792#comment-14097792
 ] 

Apache Spark commented on SPARK-3048:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/1952

 Make LabeledPointParser public
 --

 Key: SPARK-3048
 URL: https://issues.apache.org/jira/browse/SPARK-3048
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 `LabeledPointParser` is used in `MLUtils.loadLabeledPoint`. Making it public 
 may be useful to parse labeled points from other data sources, e.g., DStream. 
 Now we have a util function called `MLUtils.loadStreamingLabeledPoints`, 
 which increases the complexity of the API. I really want to make `parse` as a 
 method of `LabeledPoint$` instead of creating a new class 
 `LabeledPointParser`, but it is hard to extend a case class with a static 
 method in a binary compatible way.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor

2014-08-14 Thread Jim Blomo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097808#comment-14097808
 ] 

Jim Blomo commented on SPARK-1284:
--

Hi, having trouble compiling either master or branch-1.1, I sent a request to 
the mailing list for help.  Are there any compiled snapshots?

 pyspark hangs after IOError on Executor
 ---

 Key: SPARK-1284
 URL: https://issues.apache.org/jira/browse/SPARK-1284
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Jim Blomo
Assignee: Davies Liu

 When running a reduceByKey over a cached RDD, Python fails with an exception, 
 but the failure is not detected by the task runner.  Spark and the pyspark 
 shell hang waiting for the task to finish.
 The error is:
 {code}
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File /home/hadoop/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 182, in 
 dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 118, in 
 dump_stream
 self._write_with_length(obj, stream)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 130, in 
 _write_with_length
 stream.write(serialized)
 IOError: [Errno 104] Connection reset by peer
 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 
 4257 bytes in 47 ms
 Traceback (most recent call last):
   File /home/hadoop/spark/python/pyspark/daemon.py, line 117, in 
 launch_worker
 worker(listen_sock)
   File /home/hadoop/spark/python/pyspark/daemon.py, line 107, in worker
 outfile.flush()
 IOError: [Errno 32] Broken pipe
 {code}
 I can reproduce the error by running take(10) on the cached RDD before 
 running reduceByKey (which looks at the whole input file).
 Affects Version 1.0.0-SNAPSHOT (4d88030486)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2014-08-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097826#comment-14097826
 ] 

Sandy Ryza commented on SPARK-2089:
---

H, it's true that my suggestion would require us to serialize and then 
immediately deserialize a possibly huge string.  How about Spark conf 
properties that just specify the input file and input format, and handles all 
the logic for converting this to location preferences on the other side.  This 
would also simplify things for the users (just need to set properties, not call 
any methods).

 With YARN, preferredNodeLocalityData isn't honored 
 ---

 Key: SPARK-2089
 URL: https://issues.apache.org/jira/browse/SPARK-2089
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical

 When running in YARN cluster mode, apps can pass preferred locality data when 
 constructing a Spark context that will dictate where to request executor 
 containers.
 This is currently broken because of a race condition.  The Spark-YARN code 
 runs the user class and waits for it to start up a SparkContext.  During its 
 initialization, the SparkContext will create a YarnClusterScheduler, which 
 notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
 immediately fetches the preferredNodeLocationData from the SparkContext and 
 uses it to start requesting containers.
 But in the SparkContext constructor that takes the preferredNodeLocationData, 
 setting preferredNodeLocationData comes after the rest of the initialization, 
 so, if the Spark-YARN code comes around quickly enough after being notified, 
 the data that's fetched is the empty unset version.  The occurred during all 
 of my runs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2014-08-14 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097826#comment-14097826
 ] 

Sandy Ryza edited comment on SPARK-2089 at 8/14/14 10:41 PM:
-

H, it's true that my suggestion would require us to serialize and then 
immediately deserialize a possibly huge string.  How about Spark conf 
properties that just specify the input file and input format? We would handle 
all the logic for converting this to location preferences on the other side.  
This would also simplify things for the users (just need to set properties, not 
call any methods).


was (Author: sandyr):
H, it's true that my suggestion would require us to serialize and then 
immediately deserialize a possibly huge string.  How about Spark conf 
properties that just specify the input file and input format, and handles all 
the logic for converting this to location preferences on the other side.  This 
would also simplify things for the users (just need to set properties, not call 
any methods).

 With YARN, preferredNodeLocalityData isn't honored 
 ---

 Key: SPARK-2089
 URL: https://issues.apache.org/jira/browse/SPARK-2089
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical

 When running in YARN cluster mode, apps can pass preferred locality data when 
 constructing a Spark context that will dictate where to request executor 
 containers.
 This is currently broken because of a race condition.  The Spark-YARN code 
 runs the user class and waits for it to start up a SparkContext.  During its 
 initialization, the SparkContext will create a YarnClusterScheduler, which 
 notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
 immediately fetches the preferredNodeLocationData from the SparkContext and 
 uses it to start requesting containers.
 But in the SparkContext constructor that takes the preferredNodeLocationData, 
 setting preferredNodeLocationData comes after the rest of the initialization, 
 so, if the Spark-YARN code comes around quickly enough after being notified, 
 the data that's fetched is the empty unset version.  The occurred during all 
 of my runs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3051) Support looking-up named accumulators in a registry

2014-08-14 Thread Neil Ferguson (JIRA)
Neil Ferguson created SPARK-3051:


 Summary: Support looking-up named accumulators in a registry
 Key: SPARK-3051
 URL: https://issues.apache.org/jira/browse/SPARK-3051
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Neil Ferguson


This is a proposed enhancement to Spark based on the following mailing list 
discussion: 
http://apache-spark-developers-list.1001551.n3.nabble.com/quot-Dynamic-variables-quot-in-Spark-td7450.html.

This proposal builds on SPARK-2380 (Support displaying accumulator values in 
the web UI) to allow named accumulables to be looked-up in a registry, as 
opposed to having to be passed to every method that need to access them.

The use case was described well by [~shivaram], as follows:

Lets say you have two functions you use 
in a map call and want to measure how much time each of them takes. For 
example, if you have a code block like the one below and you want to 
measure how much time f1 takes as a fraction of the task. 

{noformat}
a.map { l = 
   val f = f1(l) 
   ... some work here ... 
} 
{noformat}

It would be really cool if we could do something like 

{noformat}
a.map { l = 
   val start = System.nanoTime 
   val f = f1(l) 
   TaskMetrics.get(f1-time).add(System.nanoTime - start) 
} 
{noformat}

SPARK-2380 provides a partial solution to this problem -- however the 
accumulables would still need to be passed to every function that needs them, 
which I think would be cumbersome in any application of reasonable complexity.

The proposal, as suggested by [~pwendell], is to have a registry of 
accumulables, that can be looked-up by name. 

Regarding the implementation details, I'd propose that we broadcast a 
serialized version of all named accumulables in the DAGScheduler (similar to 
what SPARK-2521 does for Tasks). These can then be deserialized in the 
Executor. 

Accumulables are already stored in thread-local variables in the Accumulators 
object, so exposing these in the registry should be simply a matter of wrapping 
this object, and keying the accumulables by name (they are currently keyed by 
ID).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3051) Support looking-up named accumulators in a registry

2014-08-14 Thread Neil Ferguson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097893#comment-14097893
 ] 

Neil Ferguson commented on SPARK-3051:
--

I've done an initial prototype of this in the following commit: 
https://github.com/nfergu/spark/commit/fffaa5f9d0e6a0c1b7cfcd82867e733369296e07

It needs more tidy-up, documentation, and testing.

 Support looking-up named accumulators in a registry
 ---

 Key: SPARK-3051
 URL: https://issues.apache.org/jira/browse/SPARK-3051
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Neil Ferguson

 This is a proposed enhancement to Spark based on the following mailing list 
 discussion: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/quot-Dynamic-variables-quot-in-Spark-td7450.html.
 This proposal builds on SPARK-2380 (Support displaying accumulator values in 
 the web UI) to allow named accumulables to be looked-up in a registry, as 
 opposed to having to be passed to every method that need to access them.
 The use case was described well by [~shivaram], as follows:
 Lets say you have two functions you use 
 in a map call and want to measure how much time each of them takes. For 
 example, if you have a code block like the one below and you want to 
 measure how much time f1 takes as a fraction of the task. 
 {noformat}
 a.map { l = 
val f = f1(l) 
... some work here ... 
 } 
 {noformat}
 It would be really cool if we could do something like 
 {noformat}
 a.map { l = 
val start = System.nanoTime 
val f = f1(l) 
TaskMetrics.get(f1-time).add(System.nanoTime - start) 
 } 
 {noformat}
 SPARK-2380 provides a partial solution to this problem -- however the 
 accumulables would still need to be passed to every function that needs them, 
 which I think would be cumbersome in any application of reasonable complexity.
 The proposal, as suggested by [~pwendell], is to have a registry of 
 accumulables, that can be looked-up by name. 
 Regarding the implementation details, I'd propose that we broadcast a 
 serialized version of all named accumulables in the DAGScheduler (similar to 
 what SPARK-2521 does for Tasks). These can then be deserialized in the 
 Executor. 
 Accumulables are already stored in thread-local variables in the Accumulators 
 object, so exposing these in the registry should be simply a matter of 
 wrapping this object, and keying the accumulables by name (they are currently 
 keyed by ID).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3052) Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop

2014-08-14 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3052:
-

 Summary: Misleading and spurious FileSystem closed errors whenever 
a job fails while reading from Hadoop
 Key: SPARK-3052
 URL: https://issues.apache.org/jira/browse/SPARK-3052
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3052) Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097948#comment-14097948
 ] 

Apache Spark commented on SPARK-3052:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/1956

 Misleading and spurious FileSystem closed errors whenever a job fails while 
 reading from Hadoop
 ---

 Key: SPARK-3052
 URL: https://issues.apache.org/jira/browse/SPARK-3052
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3050) Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster

2014-08-14 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097947#comment-14097947
 ] 

Patrick Wendell commented on SPARK-3050:


Hi [~mkim] - when you launch jobs in client mode (i.e. the driver is running 
where you launch the job), you will need to use an update version of the 
spark-submit launch script. You don't need to recompile your job though, you 
just need to update spark-submit in tandem with your cluster upgrade. Does that 
work?

 Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster
 ---

 Key: SPARK-3050
 URL: https://issues.apache.org/jira/browse/SPARK-3050
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
Reporter: Mingyu Kim
Priority: Critical

 I ran the following code with Spark 1.0.2 jar against a cluster that runs 
 Spark 1.0.1 (i.e. localhost:7077 is running 1.0.1).
 {code}
 import java.util.ArrayList;
 import java.util.List;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 public class TestTest {
 public static void main(String[] args) {
 JavaSparkContext sc = new JavaSparkContext(spark://localhost:7077, 
 Test);
 ListInteger list = new ArrayList();
 list.add(1);
 list.add(2);
 list.add(3);
 JavaRDDInteger rdd = sc.parallelize(list);
 System.out.println(rdd.collect());
 }
 }
 {code}
 This throws InvalidClassException.
 {code}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 0.0:1 failed 4 times, most recent failure: Exception 
 failure in TID 6 on host 10.100.91.90: java.io.InvalidClassException: 
 org.apache.spark.rdd.RDD; local class incompatible: stream classdesc 
 serialVersionUID = -6766554341038829528, local class serialVersionUID = 
 385418487991259089
 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:604)
 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1835)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1794)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:722)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
   at 

[jira] [Resolved] (SPARK-3050) Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster

2014-08-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3050.


Resolution: Not a Problem

I think the issue here is just needing to use the newer version of the launch 
script. Feel free to re-open this though if I am misunderstanding.

 Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster
 ---

 Key: SPARK-3050
 URL: https://issues.apache.org/jira/browse/SPARK-3050
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
Reporter: Mingyu Kim
Priority: Critical

 I ran the following code with Spark 1.0.2 jar against a cluster that runs 
 Spark 1.0.1 (i.e. localhost:7077 is running 1.0.1).
 {code}
 import java.util.ArrayList;
 import java.util.List;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 public class TestTest {
 public static void main(String[] args) {
 JavaSparkContext sc = new JavaSparkContext(spark://localhost:7077, 
 Test);
 ListInteger list = new ArrayList();
 list.add(1);
 list.add(2);
 list.add(3);
 JavaRDDInteger rdd = sc.parallelize(list);
 System.out.println(rdd.collect());
 }
 }
 {code}
 This throws InvalidClassException.
 {code}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 0.0:1 failed 4 times, most recent failure: Exception 
 failure in TID 6 on host 10.100.91.90: java.io.InvalidClassException: 
 org.apache.spark.rdd.RDD; local class incompatible: stream classdesc 
 serialVersionUID = -6766554341038829528, local class serialVersionUID = 
 385418487991259089
 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:604)
 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1835)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1794)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:722)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 

[jira] [Created] (SPARK-3053) Reconcile spark.files.userClassPathFirst with spark.yarn.user.classpath.first

2014-08-14 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3053:
-

 Summary: Reconcile spark.files.userClassPathFirst with 
spark.yarn.user.classpath.first
 Key: SPARK-3053
 URL: https://issues.apache.org/jira/browse/SPARK-3053
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.2
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3050) Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster

2014-08-14 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3050:
---

Priority: Major  (was: Critical)

 Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster
 ---

 Key: SPARK-3050
 URL: https://issues.apache.org/jira/browse/SPARK-3050
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
Reporter: Mingyu Kim

 I ran the following code with Spark 1.0.2 jar against a cluster that runs 
 Spark 1.0.1 (i.e. localhost:7077 is running 1.0.1).
 {code}
 import java.util.ArrayList;
 import java.util.List;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.JavaSparkContext;
 public class TestTest {
 public static void main(String[] args) {
 JavaSparkContext sc = new JavaSparkContext(spark://localhost:7077, 
 Test);
 ListInteger list = new ArrayList();
 list.add(1);
 list.add(2);
 list.add(3);
 JavaRDDInteger rdd = sc.parallelize(list);
 System.out.println(rdd.collect());
 }
 }
 {code}
 This throws InvalidClassException.
 {code}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 0.0:1 failed 4 times, most recent failure: Exception 
 failure in TID 6 on host 10.100.91.90: java.io.InvalidClassException: 
 org.apache.spark.rdd.RDD; local class incompatible: stream classdesc 
 serialVersionUID = -6766554341038829528, local class serialVersionUID = 
 385418487991259089
 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:604)
 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620)
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1835)
 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1794)
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:722)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

[jira] [Created] (SPARK-3054) Add tests for SparkSink

2014-08-14 Thread Hari Shreedharan (JIRA)
Hari Shreedharan created SPARK-3054:
---

 Summary: Add tests for SparkSink
 Key: SPARK-3054
 URL: https://issues.apache.org/jira/browse/SPARK-3054
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3054) Add tests for SparkSink

2014-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097973#comment-14097973
 ] 

Apache Spark commented on SPARK-3054:
-

User 'harishreedharan' has created a pull request for this issue:
https://github.com/apache/spark/pull/1958

 Add tests for SparkSink
 ---

 Key: SPARK-3054
 URL: https://issues.apache.org/jira/browse/SPARK-3054
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2213) Sort Merge Join

2014-08-14 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097995#comment-14097995
 ] 

Cheng Hao commented on SPARK-2213:
--

Sort Merge Join depends on the reduce side sort merge, which was done in 
SPARK-2926.

 Sort Merge Join
 ---

 Key: SPARK-2213
 URL: https://issues.apache.org/jira/browse/SPARK-2213
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3055) Stack trace logged in driver on job failure is usually uninformative

2014-08-14 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3055:
-

 Summary: Stack trace logged in driver on job failure is usually 
uninformative
 Key: SPARK-3055
 URL: https://issues.apache.org/jira/browse/SPARK-3055
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Priority: Minor


{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:5 
failed 4 times, most recent failure: TID 24 on host hddn04.lsrc.duke.edu failed 
for unknown reason
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

At a cursory glance, I would expect the stack trace to have something to be 
where the task error occurred.  In fact it's where the driver became aware of 
the error and decided to fail the job.  This has been a common point of 
confusion among our customers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3056) Sort-based Aggregation

2014-08-14 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-3056:


 Summary: Sort-based Aggregation
 Key: SPARK-3056
 URL: https://issues.apache.org/jira/browse/SPARK-3056
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao


Currently, SparkSQL only support the hash-based aggregation, which may cause 
OOM if too many identical keys in the input tuples.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >