[jira] [Created] (SPARK-3027) Tighten the visibility of fields in TaskContext and provide Java friendly callback API
Reynold Xin created SPARK-3027: -- Summary: Tighten the visibility of fields in TaskContext and provide Java friendly callback API Key: SPARK-3027 URL: https://issues.apache.org/jira/browse/SPARK-3027 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin TaskContext suffers from two problems: 1. The callback API accepts a Scala closure and is not Java friendly. 2. interrupted, completed are public vars, which means users can randomly change the control flow of tasks. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3027) Tighten visibility and provide Java friendly callback API in TaskContext
[ https://issues.apache.org/jira/browse/SPARK-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096632#comment-14096632 ] Apache Spark commented on SPARK-3027: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/1938 Tighten visibility and provide Java friendly callback API in TaskContext Key: SPARK-3027 URL: https://issues.apache.org/jira/browse/SPARK-3027 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin TaskContext suffers from two problems: 1. The callback API accepts a Scala closure and is not Java friendly. 2. interrupted, completed are public vars, which means users can randomly change the control flow of tasks. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3027) Tighten visibility and provide Java friendly callback API in TaskContext
[ https://issues.apache.org/jira/browse/SPARK-3027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3027: --- Summary: Tighten visibility and provide Java friendly callback API in TaskContext (was: Tighten the visibility of fields in TaskContext and provide Java friendly callback API) Tighten visibility and provide Java friendly callback API in TaskContext Key: SPARK-3027 URL: https://issues.apache.org/jira/browse/SPARK-3027 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin TaskContext suffers from two problems: 1. The callback API accepts a Scala closure and is not Java friendly. 2. interrupted, completed are public vars, which means users can randomly change the control flow of tasks. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096636#comment-14096636 ] Kostiantyn Kudriavtsev commented on SPARK-2356: --- Guoqiang, Spark works not exclusively with Hadoop, but can live absolutely out of Hadoop cluster/environment. So, it's obvious that these two variables might be not set. Exception: Could not locate executable null\bin\winutils.exe in the Hadoop --- Key: SPARK-2356 URL: https://issues.apache.org/jira/browse/SPARK-2356 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kostiantyn Kudriavtsev Priority: Critical I'm trying to run some transformation on Spark, it works fine on cluster (YARN, linux machines). However, when I'm trying to run it on local machine (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file from local filesystem): {code} 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) {code} It's happened because Hadoop config is initialized each time when spark context is created regardless is hadoop required or not. I propose to add some special flag to indicate if hadoop config is required (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096635#comment-14096635 ] Tarek Nabil commented on SPARK-2356: Yes, but the whole point is that you should do not need Hadoop at all. Exception: Could not locate executable null\bin\winutils.exe in the Hadoop --- Key: SPARK-2356 URL: https://issues.apache.org/jira/browse/SPARK-2356 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kostiantyn Kudriavtsev Priority: Critical I'm trying to run some transformation on Spark, it works fine on cluster (YARN, linux machines). However, when I'm trying to run it on local machine (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file from local filesystem): {code} 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) {code} It's happened because Hadoop config is initialized each time when spark context is created regardless is hadoop required or not. I propose to add some special flag to indicate if hadoop config is required (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate
Reynold Xin created SPARK-3028: -- Summary: sparkEventToJson should support SparkListenerExecutorMetricsUpdate Key: SPARK-3028 URL: https://issues.apache.org/jira/browse/SPARK-3028 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin Priority: Blocker SparkListenerExecutorMetricsUpdate was added without updating org.apache.spark.util.JsonProtocol.sparkEventToJson. This can crash the listener. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate
[ https://issues.apache.org/jira/browse/SPARK-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096645#comment-14096645 ] Reynold Xin commented on SPARK-3028: [~sandyr] [~sandyryza] Would you be able to do this quickly (tmr maybe)? If not, somebody else should fix this ... cc [~andrewor14] sparkEventToJson should support SparkListenerExecutorMetricsUpdate -- Key: SPARK-3028 URL: https://issues.apache.org/jira/browse/SPARK-3028 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin Priority: Blocker SparkListenerExecutorMetricsUpdate was added without updating org.apache.spark.util.JsonProtocol.sparkEventToJson. This can crash the listener. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2456) Scheduler refactoring
[ https://issues.apache.org/jira/browse/SPARK-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2456: --- Component/s: Spark Core Scheduler refactoring - Key: SPARK-2456 URL: https://issues.apache.org/jira/browse/SPARK-2456 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin This is an umbrella ticket to track scheduler refactoring. We want to clearly define semantics and responsibilities of each component, and define explicit public interfaces for them so it is easier to understand and to contribute (also less buggy). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3029) Disable local execution of Spark jobs by default
Aaron Davidson created SPARK-3029: - Summary: Disable local execution of Spark jobs by default Key: SPARK-3029 URL: https://issues.apache.org/jira/browse/SPARK-3029 Project: Spark Issue Type: Improvement Reporter: Aaron Davidson Assignee: Aaron Davidson Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead. Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3019) Pluggable block transfer (data plane communication) interface
[ https://issues.apache.org/jira/browse/SPARK-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096654#comment-14096654 ] Hari Shreedharan commented on SPARK-3019: - Why specifically MapR FS? You could use HDFS for this as well right? Pluggable block transfer (data plane communication) interface - Key: SPARK-3019 URL: https://issues.apache.org/jira/browse/SPARK-3019 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Attachments: PluggableBlockTransferServiceProposalforSpark - draft 1.pdf The attached design doc proposes a standard interface for block transferring, which will make future engineering of this functionality easier, allowing the Spark community to provide alternative implementations. Block transferring is a critical function in Spark. All of the following depend on it: * shuffle * torrent broadcast * block replication in BlockManager * remote block reads for tasks scheduled without locality -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3030) reuse python worker
Davies Liu created SPARK-3030: - Summary: reuse python worker Key: SPARK-3030 URL: https://issues.apache.org/jira/browse/SPARK-3030 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Currently, it will fork an Python worker for each task, it will better if we can reuse the worker for later tasks. This will be very useful for large dataset with big broadcast, so it does not need to sending broadcast to worker again and again. Also it can reduce the overhead of launch a task. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3019) Pluggable block transfer (data plane communication) interface
[ https://issues.apache.org/jira/browse/SPARK-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096658#comment-14096658 ] Reynold Xin commented on SPARK-3019: Possibly, although I think MapR FS is more optimized for this kind of workload, as they have done that previously for MR shuffle. Don't quote me on this one. Pluggable block transfer (data plane communication) interface - Key: SPARK-3019 URL: https://issues.apache.org/jira/browse/SPARK-3019 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin Attachments: PluggableBlockTransferServiceProposalforSpark - draft 1.pdf The attached design doc proposes a standard interface for block transferring, which will make future engineering of this functionality easier, allowing the Spark community to provide alternative implementations. Block transferring is a critical function in Spark. All of the following depend on it: * shuffle * torrent broadcast * block replication in BlockManager * remote block reads for tasks scheduled without locality -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1170) Add histogram() to PySpark
[ https://issues.apache.org/jira/browse/SPARK-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandan Kumar resolved SPARK-1170. -- Resolution: Duplicate Davies is working on this. Add histogram() to PySpark -- Key: SPARK-1170 URL: https://issues.apache.org/jira/browse/SPARK-1170 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Matei Zaharia Assignee: Josh Rosen Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3031) Create JsonSerializable and move JSON serialization from JsonProtocol into each class
[ https://issues.apache.org/jira/browse/SPARK-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096664#comment-14096664 ] Reynold Xin commented on SPARK-3031: cc [~andrewor14] [~adav] Create JsonSerializable and move JSON serialization from JsonProtocol into each class - Key: SPARK-3031 URL: https://issues.apache.org/jira/browse/SPARK-3031 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin It is really, really weird that we have a single file JsonProtocol that handles the JSON serialization/deserialization for a bunch of classes. This is very error prone because it is easy to add a new field to a class, and forget to update the JSON part of it. Or worse, add a new event class and forget to add one, as evidenced by https://issues.apache.org/jira/browse/SPARK-3028 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3029) Disable local execution of Spark jobs by default
[ https://issues.apache.org/jira/browse/SPARK-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3029: --- Priority: Blocker (was: Major) Target Version/s: 1.1.0 Disable local execution of Spark jobs by default Key: SPARK-3029 URL: https://issues.apache.org/jira/browse/SPARK-3029 Project: Spark Issue Type: Improvement Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Blocker Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead. Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3031) Create JsonSerializable and move JSON serialization from JsonProtocol into each class
[ https://issues.apache.org/jira/browse/SPARK-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3031: --- Component/s: Spark Core Create JsonSerializable and move JSON serialization from JsonProtocol into each class - Key: SPARK-3031 URL: https://issues.apache.org/jira/browse/SPARK-3031 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin It is really, really weird that we have a single file JsonProtocol that handles the JSON serialization/deserialization for a bunch of classes. This is very error prone because it is easy to add a new field to a class, and forget to update the JSON part of it. Or worse, add a new event class and forget to add one, as evidenced by https://issues.apache.org/jira/browse/SPARK-3028 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3031) Create JsonSerializable and move JSON serialization from JsonProtocol into each class
Reynold Xin created SPARK-3031: -- Summary: Create JsonSerializable and move JSON serialization from JsonProtocol into each class Key: SPARK-3031 URL: https://issues.apache.org/jira/browse/SPARK-3031 Project: Spark Issue Type: Bug Reporter: Reynold Xin It is really, really weird that we have a single file JsonProtocol that handles the JSON serialization/deserialization for a bunch of classes. This is very error prone because it is easy to add a new field to a class, and forget to update the JSON part of it. Or worse, add a new event class and forget to add one, as evidenced by https://issues.apache.org/jira/browse/SPARK-3028 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2995) Allow to set storage level for intermediate RDDs in ALS
[ https://issues.apache.org/jira/browse/SPARK-2995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2995. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1913 [https://github.com/apache/spark/pull/1913] Allow to set storage level for intermediate RDDs in ALS --- Key: SPARK-2995 URL: https://issues.apache.org/jira/browse/SPARK-2995 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.1.0 As mentioned in [SPARK-2465], using MEMORY_AND_DISK_SER together with spark.rdd.compress=true can help reduce the space requirement by a lot, at the cost of speed. It might be useful to add this option so people can run ALS on much bigger datasets. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2893) Should not swallow exception when cannot find custom Kryo registrator
[ https://issues.apache.org/jira/browse/SPARK-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2893: --- Priority: Blocker (was: Major) Target Version/s: 1.1.0 Should not swallow exception when cannot find custom Kryo registrator - Key: SPARK-2893 URL: https://issues.apache.org/jira/browse/SPARK-2893 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Graham Dennis Priority: Blocker If for any reason a custom Kryo registrator cannot be found, the task ( application) should fail, because otherwise problems could occur later during data serialisation / deserialisation. See SPARK-2878 for an example. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2893) Should not swallow exception when cannot find custom Kryo registrator
[ https://issues.apache.org/jira/browse/SPARK-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2893: --- Assignee: Graham Dennis Should not swallow exception when cannot find custom Kryo registrator - Key: SPARK-2893 URL: https://issues.apache.org/jira/browse/SPARK-2893 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Graham Dennis Assignee: Graham Dennis Priority: Blocker If for any reason a custom Kryo registrator cannot be found, the task ( application) should fail, because otherwise problems could occur later during data serialisation / deserialisation. See SPARK-2878 for an example. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2893) Should not swallow exception when cannot find custom Kryo registrator
[ https://issues.apache.org/jira/browse/SPARK-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2893: --- Target Version/s: 1.1.0, 1.0.3 (was: 1.1.0) Should not swallow exception when cannot find custom Kryo registrator - Key: SPARK-2893 URL: https://issues.apache.org/jira/browse/SPARK-2893 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Graham Dennis Priority: Blocker If for any reason a custom Kryo registrator cannot be found, the task ( application) should fail, because otherwise problems could occur later during data serialisation / deserialisation. See SPARK-2878 for an example. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2878) Inconsistent Kryo serialisation with custom Kryo Registrator
[ https://issues.apache.org/jira/browse/SPARK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2878: --- Assignee: Graham Dennis Inconsistent Kryo serialisation with custom Kryo Registrator Key: SPARK-2878 URL: https://issues.apache.org/jira/browse/SPARK-2878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.2 Environment: Linux RedHat EL 6, 4-node Spark cluster. Reporter: Graham Dennis Assignee: Graham Dennis The custom Kryo Registrator (a class with the org.apache.spark.serializer.KryoRegistrator trait) is not used with every Kryo instance created, and this causes inconsistent serialisation and deserialisation. The Kryo Registrator is sometimes not used because of a ClassNotFound exception that only occurs if it *isn't* the Worker thread (of an Executor) that tries to create the KryoRegistrator. A complete description of the problem and a project reproducing the problem can be found at https://github.com/GrahamDennis/spark-kryo-serialisation I have currently only tested this with Spark 1.0.0, but will try to test against 1.0.2. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096608#comment-14096608 ] Saisai Shao edited comment on SPARK-2926 at 8/14/14 7:12 AM: - Hi Matei, I just uploaded a Spark shuffle performance test report. In this report, I choose 3 different workloads (sort-by-key, aggregate-by-key and group-by-key) in SparkPerf to test the performance of current 3 shuffle implementations: hash-based shuffle; sort-based shuffle with HashShuffleReader; sort-based shuffle with sort-merge shuffle reader (our prototype). Generally for sort-by-key our prototype can gain more benefits than other two implementations, while for other two workloads the performance is almost the same. Would you mind taking a look at it, any comment would be greatly appreciated, thanks a lot. was (Author: jerryshao): Hi Matei, I just uploaded a Spark shuffle performance test report. In this report, I choose 3 different workload (sort-by-key, aggregate-by-key and group-by-key) in SparkPerf to test the performance of current 3 shuffle implementations: hash-based shuffle; sort-based shuffle with HashShuffleReader; sort-based shuffle with sort-merge shuffle reader (our prototype). Would you mind taking a look at it, any comment would be greatly appreciated, thanks a lot. Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle -- Key: SPARK-2926 URL: https://issues.apache.org/jira/browse/SPARK-2926 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.1.0 Reporter: Saisai Shao Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report.pdf Currently Spark has already integrated sort-based shuffle write, which greatly improve the IO performance and reduce the memory consumption when reducer number is very large. But for the reducer side, it still adopts the implementation of hash-based shuffle reader, which neglects the ordering attributes of map output data in some situations. Here we propose a MR style sort-merge like shuffle reader for sort-based shuffle to better improve the performance of sort-based shuffle. Working in progress code and performance test report will be posted later when some unit test bugs are fixed. Any comments would be greatly appreciated. Thanks a lot. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2878) Inconsistent Kryo serialisation with custom Kryo Registrator
[ https://issues.apache.org/jira/browse/SPARK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2878: --- Target Version/s: 1.1.0, 1.0.3 Inconsistent Kryo serialisation with custom Kryo Registrator Key: SPARK-2878 URL: https://issues.apache.org/jira/browse/SPARK-2878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.2 Environment: Linux RedHat EL 6, 4-node Spark cluster. Reporter: Graham Dennis Assignee: Graham Dennis The custom Kryo Registrator (a class with the org.apache.spark.serializer.KryoRegistrator trait) is not used with every Kryo instance created, and this causes inconsistent serialisation and deserialisation. The Kryo Registrator is sometimes not used because of a ClassNotFound exception that only occurs if it *isn't* the Worker thread (of an Executor) that tries to create the KryoRegistrator. A complete description of the problem and a project reproducing the problem can be found at https://github.com/GrahamDennis/spark-kryo-serialisation I have currently only tested this with Spark 1.0.0, but will try to test against 1.0.2. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate
[ https://issues.apache.org/jira/browse/SPARK-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096689#comment-14096689 ] Patrick Wendell commented on SPARK-3028: I think we intentionally do not intend to ever serialize an update event as JSON. However, I think we should be more explicit about this in two ways. 1. In EventLoggingListener we should explicit put a no-op implementation of onExecutorMetricsUpdate with a comment - I actually thought one version of the PR had this, but maybe I'm misremembering. 2. We should also include the SparkListenerExecutorMetricsUpdate in the match block and just have a no-op there also with a comment. sparkEventToJson should support SparkListenerExecutorMetricsUpdate -- Key: SPARK-3028 URL: https://issues.apache.org/jira/browse/SPARK-3028 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin Priority: Blocker SparkListenerExecutorMetricsUpdate was added without updating org.apache.spark.util.JsonProtocol.sparkEventToJson. This can crash the listener. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096688#comment-14096688 ] Saisai Shao commented on SPARK-2926: I think this prototype can easily offer the functionality SPARK-2978 needed. Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle -- Key: SPARK-2926 URL: https://issues.apache.org/jira/browse/SPARK-2926 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.1.0 Reporter: Saisai Shao Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report.pdf Currently Spark has already integrated sort-based shuffle write, which greatly improve the IO performance and reduce the memory consumption when reducer number is very large. But for the reducer side, it still adopts the implementation of hash-based shuffle reader, which neglects the ordering attributes of map output data in some situations. Here we propose a MR style sort-merge like shuffle reader for sort-based shuffle to better improve the performance of sort-based shuffle. Working in progress code and performance test report will be posted later when some unit test bugs are fixed. Any comments would be greatly appreciated. Thanks a lot. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate
[ https://issues.apache.org/jira/browse/SPARK-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096689#comment-14096689 ] Patrick Wendell edited comment on SPARK-3028 at 8/14/14 7:20 AM: - I think we currently do not intend to ever serialize an update event as JSON. However, I think we should be more explicit about this in two ways. 1. In EventLoggingListener we should explicit put a no-op implementation of onExecutorMetricsUpdate with a comment - I actually thought one version of the PR had this, but maybe I'm misremembering. 2. We should also include the SparkListenerExecutorMetricsUpdate in the match block and just have a no-op there also with a comment. was (Author: pwendell): I think we explicitly do not intend to ever serialize an update event as JSON. However, I think we should be more explicit about this in two ways. 1. In EventLoggingListener we should explicit put a no-op implementation of onExecutorMetricsUpdate with a comment - I actually thought one version of the PR had this, but maybe I'm misremembering. 2. We should also include the SparkListenerExecutorMetricsUpdate in the match block and just have a no-op there also with a comment. sparkEventToJson should support SparkListenerExecutorMetricsUpdate -- Key: SPARK-3028 URL: https://issues.apache.org/jira/browse/SPARK-3028 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin Priority: Blocker SparkListenerExecutorMetricsUpdate was added without updating org.apache.spark.util.JsonProtocol.sparkEventToJson. This can crash the listener. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate
[ https://issues.apache.org/jira/browse/SPARK-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096696#comment-14096696 ] Reynold Xin commented on SPARK-3028: Ok in that case, you suggestion makes sense sparkEventToJson should support SparkListenerExecutorMetricsUpdate -- Key: SPARK-3028 URL: https://issues.apache.org/jira/browse/SPARK-3028 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin Priority: Blocker SparkListenerExecutorMetricsUpdate was added without updating org.apache.spark.util.JsonProtocol.sparkEventToJson. This can crash the listener. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()
[ https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Zhongxing updated SPARK-3005: Comment: was deleted (was: A related question: why does fined-grain mode and coarse-grained mode perform differently? Neither of them implement the killTask() method.) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask() --- Key: SPARK-3005 URL: https://issues.apache.org/jira/browse/SPARK-3005 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector Reporter: Xu Zhongxing Attachments: SPARK-3005_1.diff I am using Spark, Mesos, spark-cassandra-connector to do some work on a cassandra cluster. During the job running, I killed the Cassandra daemon to simulate some failure cases. This results in task failures. If I run the job in Mesos coarse-grained mode, the spark driver program throws an exception and shutdown cleanly. But when I run the job in Mesos fine-grained mode, the spark driver program hangs. The spark log is: {code} INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 Logging.scala (line 58) Cancelling stage 1 INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 Logging.scala (line 79) Could not cancel tasks for stage 1 java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at
[jira] [Updated] (SPARK-2878) Inconsistent Kryo serialisation with custom Kryo Registrator
[ https://issues.apache.org/jira/browse/SPARK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2878: --- Priority: Critical (was: Major) Inconsistent Kryo serialisation with custom Kryo Registrator Key: SPARK-2878 URL: https://issues.apache.org/jira/browse/SPARK-2878 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.2 Environment: Linux RedHat EL 6, 4-node Spark cluster. Reporter: Graham Dennis Assignee: Graham Dennis Priority: Critical The custom Kryo Registrator (a class with the org.apache.spark.serializer.KryoRegistrator trait) is not used with every Kryo instance created, and this causes inconsistent serialisation and deserialisation. The Kryo Registrator is sometimes not used because of a ClassNotFound exception that only occurs if it *isn't* the Worker thread (of an Executor) that tries to create the KryoRegistrator. A complete description of the problem and a project reproducing the problem can be found at https://github.com/GrahamDennis/spark-kryo-serialisation I have currently only tested this with Spark 1.0.0, but will try to test against 1.0.2. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()
[ https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095288#comment-14095288 ] Xu Zhongxing edited comment on SPARK-3005 at 8/14/14 7:36 AM: -- Some additional driver logs during the spark driver hang: {code} TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,908 Logging.scala (line 66) Checking for newly runnable parent stages INFO [Result resolver thread-1] 2014-08-13 15:58:15,908 Logging.scala (line 58) Removed TaskSet 1.0, whose tasks have all completed, from pool TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 Logging.scala (line 66) running: Set(Stage 1, Stage 2) TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 Logging.scala (line 66) waiting: Set(Stage 0) TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 Logging.scala (line 66) failed: Set() DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 Logging.scala (line 62) submitStage(Stage 0) DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,910 Logging.scala (line 62) missing: List(Stage 1, Stage 2) DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,910 Logging.scala (line 62) submitStage(Stage 1) DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,910 Logging.scala (line 62) submitStage(Stage 2) TRACE [spark-akka.actor.default-dispatcher-3] 2014-08-13 15:58:56,643 Logging.scala (line 66) Checking for hosts with no recent heart beats in BlockManagerMaster. TRACE [spark-akka.actor.default-dispatcher-6] 2014-08-13 15:59:56,653 Logging.scala (line 66) Checking for hosts with no recent heart beats in BlockManagerMaster. TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 16:00:56,652 Logging.scala (line 66) Checking for hosts with no recent heart beats in BlockManagerMaster. {code} was (Author: xuzhongxing): Some additional logs during the spark driver hang: TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,908 Logging.scala (line 66) Checking for newly runnable parent stages INFO [Result resolver thread-1] 2014-08-13 15:58:15,908 Logging.scala (line 58) Removed TaskSet 1.0, whose tasks have all completed, from pool TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 Logging.scala (line 66) running: Set(Stage 1, Stage 2) TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 Logging.scala (line 66) waiting: Set(Stage 0) TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 Logging.scala (line 66) failed: Set() DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,909 Logging.scala (line 62) submitStage(Stage 0) DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,910 Logging.scala (line 62) missing: List(Stage 1, Stage 2) DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,910 Logging.scala (line 62) submitStage(Stage 1) DEBUG [spark-akka.actor.default-dispatcher-2] 2014-08-13 15:58:15,910 Logging.scala (line 62) submitStage(Stage 2) TRACE [spark-akka.actor.default-dispatcher-3] 2014-08-13 15:58:56,643 Logging.scala (line 66) Checking for hosts with no recent heart beats in BlockManagerMaster. TRACE [spark-akka.actor.default-dispatcher-6] 2014-08-13 15:59:56,653 Logging.scala (line 66) Checking for hosts with no recent heart beats in BlockManagerMaster. TRACE [spark-akka.actor.default-dispatcher-2] 2014-08-13 16:00:56,652 Logging.scala (line 66) Checking for hosts with no recent heart beats in BlockManagerMaster. Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask() --- Key: SPARK-3005 URL: https://issues.apache.org/jira/browse/SPARK-3005 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector Reporter: Xu Zhongxing Attachments: SPARK-3005_1.diff I am using Spark, Mesos, spark-cassandra-connector to do some work on a cassandra cluster. During the job running, I killed the Cassandra daemon to simulate some failure cases. This results in task failures. If I run the job in Mesos coarse-grained mode, the spark driver program throws an exception and shutdown cleanly. But when I run the job in Mesos fine-grained mode, the spark driver program hangs. The spark log is: {code} INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 Logging.scala (line 58) Cancelling stage 1 INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 Logging.scala (line 79) Could not cancel tasks for stage 1 java.lang.UnsupportedOperationException
[jira] [Comment Edited] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()
[ https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096539#comment-14096539 ] Xu Zhongxing edited comment on SPARK-3005 at 8/14/14 7:37 AM: -- Could adding an empty killTask method to MesosSchedulerBackend fix this problem? {code:title=MesosSchedulerBackend.scala} override def killTask(taskId: Long, executorId: String, interruptThread: Boolean) {} {code} This works for my tests. was (Author: xuzhongxing): Could adding an empty killTask method to MesosSchedulerBackend fix this problem? {code} override def killTask(taskId: Long, executorId: String, interruptThread: Boolean) {} {code} This works for my tests. Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask() --- Key: SPARK-3005 URL: https://issues.apache.org/jira/browse/SPARK-3005 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector Reporter: Xu Zhongxing Attachments: SPARK-3005_1.diff I am using Spark, Mesos, spark-cassandra-connector to do some work on a cassandra cluster. During the job running, I killed the Cassandra daemon to simulate some failure cases. This results in task failures. If I run the job in Mesos coarse-grained mode, the spark driver program throws an exception and shutdown cleanly. But when I run the job in Mesos fine-grained mode, the spark driver program hangs. The spark log is: {code} INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 Logging.scala (line 58) Cancelling stage 1 INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 Logging.scala (line 79) Could not cancel tasks for stage 1 java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at
[jira] [Created] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort
Saisai Shao created SPARK-3032: -- Summary: Potential bug when running sort-based shuffle with sorting using TimSort Key: SPARK-3032 URL: https://issues.apache.org/jira/browse/SPARK-3032 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.1.0 Reporter: Saisai Shao When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, data type for key and value is (String, String), always meet this issue: {noformat} java.lang.IllegalArgumentException: Comparison method violates its general contract! at org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755) at org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493) at org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420) at org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294) at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128) at org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83) at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) at org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) at org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {noformat} Seems the current partitionKeyComparator which use hashcode of String as key comparator break some sorting contracts. Also I tested using data type Int as key, this is OK to pass the test, since hashcode of Int is its self. So I think potentially partitionDiff + hashcode of String may break the sorting contracts. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3005) Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask()
[ https://issues.apache.org/jira/browse/SPARK-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096725#comment-14096725 ] Apache Spark commented on SPARK-3005: - User 'xuzhongxing' has created a pull request for this issue: https://github.com/apache/spark/pull/1940 Spark with Mesos fine-grained mode throws UnsupportedOperationException in MesosSchedulerBackend.killTask() --- Key: SPARK-3005 URL: https://issues.apache.org/jira/browse/SPARK-3005 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Environment: Spark 1.0.2, Mesos 0.18.1, spark-cassandra-connector Reporter: Xu Zhongxing Attachments: SPARK-3005_1.diff I am using Spark, Mesos, spark-cassandra-connector to do some work on a cassandra cluster. During the job running, I killed the Cassandra daemon to simulate some failure cases. This results in task failures. If I run the job in Mesos coarse-grained mode, the spark driver program throws an exception and shutdown cleanly. But when I run the job in Mesos fine-grained mode, the spark driver program hangs. The spark log is: {code} INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,794 Logging.scala (line 58) Cancelling stage 1 INFO [spark-akka.actor.default-dispatcher-4] 2014-08-13 15:58:15,797 Logging.scala (line 79) Could not cancel tasks for stage 1 java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1075) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1061) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1061) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at
[jira] [Resolved] (SPARK-3029) Disable local execution of Spark jobs by default
[ https://issues.apache.org/jira/browse/SPARK-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3029. Resolution: Fixed Fix Version/s: 1.1.0 Disable local execution of Spark jobs by default Key: SPARK-3029 URL: https://issues.apache.org/jira/browse/SPARK-3029 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Blocker Fix For: 1.1.0 Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead. Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3033) java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal
[ https://issues.apache.org/jira/browse/SPARK-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pengyanhong updated SPARK-3033: --- Description: run a complex HiveQL via yarn-cluster, got error as below: {quote} 14/08/14 15:05:24 WARN org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to java.lang.ClassCastException java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51) at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022) at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82) at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84) at org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62) at org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309) at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {quote} was: 14/08/14 15:05:24 WARN org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to java.lang.ClassCastException java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51) at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022) at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82) at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84) at org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62) at org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309) at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571) at
[jira] [Created] (SPARK-3034) java.sql.Date cannot be cast to java.sql.Timestamp
pengyanhong created SPARK-3034: -- Summary: java.sql.Date cannot be cast to java.sql.Timestamp Key: SPARK-3034 URL: https://issues.apache.org/jira/browse/SPARK-3034 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.0.2 Reporter: pengyanhong Priority: Blocker Exception in thread Thread-2 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 141 on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: java.sql.Date cannot be cast to java.sql.Timestamp org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33) org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Updated] (SPARK-3034) java.sql.Date cannot be cast to java.sql.Timestamp
[ https://issues.apache.org/jira/browse/SPARK-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pengyanhong updated SPARK-3034: --- Description: run a simple HiveQL via yarn-cluster, got error as below: {quote} Exception in thread Thread-2 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 141 on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: java.sql.Date cannot be cast to java.sql.Timestamp org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33) org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) was: Exception in thread Thread-2 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at
[jira] [Updated] (SPARK-3034) java.sql.Date cannot be cast to java.sql.Timestamp
[ https://issues.apache.org/jira/browse/SPARK-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pengyanhong updated SPARK-3034: --- Description: run a simple HiveQL via yarn-cluster, got error as below: {quote} Exception in thread Thread-2 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 141 on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: java.sql.Date cannot be cast to java.sql.Timestamp org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33) org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {quote} above error is thrown in the stage of inserting result data into a hive table which has a field of timestamp data type. was: run a simple HiveQL via yarn-cluster, got error as below: {quote} Exception in thread Thread-2 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Created] (SPARK-3035) Wrong example with SparkContext.addFile
Daehan Kim created SPARK-3035: - Summary: Wrong example with SparkContext.addFile Key: SPARK-3035 URL: https://issues.apache.org/jira/browse/SPARK-3035 Project: Spark Issue Type: Documentation Components: PySpark Affects Versions: 1.0.2 Reporter: Daehan Kim Priority: Trivial Fix For: 1.0.2 {code:title=context.py} def addFile(self, path): ... from pyspark import SparkFiles path = os.path.join(tempdir, test.txt) with open(path, w) as testFile: ...testFile.write(100) sc.addFile(path) def func(iterator): ...with open(SparkFiles.get(test.txt)) as testFile: ...fileVal = int(testFile.readline()) ...return [x * 100 for x in iterator] sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect() [100, 200, 300, 400] {code} This is example that write 100 to temp file and distribute it and use it's value when multiplying values(to see if nodes can read distributed file) But look this lines, result will never be effected by distributed file: {code} ...fileVal = int(testFile.readline()) ...return [x * 100 for x in iterator] {code} I'm sure this code was intended as like this: {code} ...fileVal = int(testFile.readline()) ...return [x * fileVal for x in iterator] {code} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2893) Should not swallow exception when cannot find custom Kryo registrator
[ https://issues.apache.org/jira/browse/SPARK-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2893. Resolution: Fixed Fix Version/s: 1.0.3 1.1.0 Should not swallow exception when cannot find custom Kryo registrator - Key: SPARK-2893 URL: https://issues.apache.org/jira/browse/SPARK-2893 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Graham Dennis Assignee: Graham Dennis Priority: Blocker Fix For: 1.1.0, 1.0.3 If for any reason a custom Kryo registrator cannot be found, the task ( application) should fail, because otherwise problems could occur later during data serialisation / deserialisation. See SPARK-2878 for an example. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3036) Add MapType containing null value support to Parquet.
Takuya Ueshin created SPARK-3036: Summary: Add MapType containing null value support to Parquet. Key: SPARK-3036 URL: https://issues.apache.org/jira/browse/SPARK-3036 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Priority: Blocker Current Parquet schema for {{MapType}} is as follows regardless of {{valueContainsNull}}: {noformat} message root { optional group a (MAP) { repeated group map (MAP_KEY_VALUE) { required int32 key; required int32 value; } } } {noformat} and if the map contains {{null}} value, it throws runtime exception. To handle {{MapType}} containing {{null}} value, the schema should be as follows if {{valueContainsNull}} is {{true}}: {noformat} message root { optional group a (MAP) { repeated group map (MAP_KEY_VALUE) { required int32 key; optional int32 value; } } } {noformat} FYI: Hive's Parquet writer *always* uses the latter schema, but reader can read from both schema. NOTICE: This change will break backward compatibility when the schema is read from Parquet metadata ({{org.apache.spark.sql.parquet.row.metadata}}). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3038) delete history server logs when there are too many logs
wangfei created SPARK-3038: -- Summary: delete history server logs when there are too many logs Key: SPARK-3038 URL: https://issues.apache.org/jira/browse/SPARK-3038 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.1 Reporter: wangfei Fix For: 1.1.0 enhance history server to delete logs automatically 1 use spark.history.deletelogs.enable to enable this function 2 when app logs num is greater than spark.history.maxsavedapplication, delete the old logs -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3038) delete history server logs when there are too many logs
[ https://issues.apache.org/jira/browse/SPARK-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-3038: --- Description: enhance history server to delete logs automatically 1 use spark.history.deletelogs.enable to enable this function 2 when app logs num is greater than spark.history.maxsavedapplication, delete the older logs was: enhance history server to delete logs automatically 1 use spark.history.deletelogs.enable to enable this function 2 when app logs num is greater than spark.history.maxsavedapplication, delete the old logs delete history server logs when there are too many logs Key: SPARK-3038 URL: https://issues.apache.org/jira/browse/SPARK-3038 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.1 Reporter: wangfei Fix For: 1.1.0 enhance history server to delete logs automatically 1 use spark.history.deletelogs.enable to enable this function 2 when app logs num is greater than spark.history.maxsavedapplication, delete the older logs -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
Bertrand Bossy created SPARK-3039: - Summary: Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API Key: SPARK-3039 URL: https://issues.apache.org/jira/browse/SPARK-3039 Project: Spark Issue Type: Bug Components: Build, Input/Output, Spark Core Affects Versions: 1.0.0, 0.9.1 Environment: hadoop2, hadoop-2.4.0, HDP-2.1 Reporter: Bertrand Bossy The spark assembly contains the artifact org.apache.avro:avro-mapred as a dependency of org.spark-project.hive:hive-serde. The avro-mapred package provides a hadoop FileInputFormat to read and write avro files. There are two versions of this package, distinguished by a classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. avro-mapred for the old Hadoop API uses no classifier. E.g. when reading avro files using {code} sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro) {code} The following error occurs: {code} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} This error usually is a hint that there was a mix up of the old and the new Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to appear before the version that is bundled with Spark, reading avro files works fine. Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3026) Provide a good error message if JDBC server is used but Spark is not compiled with -Pthriftserver
[ https://issues.apache.org/jira/browse/SPARK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096961#comment-14096961 ] Apache Spark commented on SPARK-3026: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/1944 Provide a good error message if JDBC server is used but Spark is not compiled with -Pthriftserver - Key: SPARK-3026 URL: https://issues.apache.org/jira/browse/SPARK-3026 Project: Spark Issue Type: Bug Components: SQL Reporter: Patrick Wendell Assignee: Cheng Lian Priority: Critical Instead of giving a ClassNotFoundException we should detect this case and just tell the user to build with -Phiveserver. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bertrand Bossy updated SPARK-3039: -- Affects Version/s: 1.1.0 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API -- Key: SPARK-3039 URL: https://issues.apache.org/jira/browse/SPARK-3039 Project: Spark Issue Type: Bug Components: Build, Input/Output, Spark Core Affects Versions: 0.9.1, 1.0.0, 1.1.0 Environment: hadoop2, hadoop-2.4.0, HDP-2.1 Reporter: Bertrand Bossy The spark assembly contains the artifact org.apache.avro:avro-mapred as a dependency of org.spark-project.hive:hive-serde. The avro-mapred package provides a hadoop FileInputFormat to read and write avro files. There are two versions of this package, distinguished by a classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. avro-mapred for the old Hadoop API uses no classifier. E.g. when reading avro files using {code} sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro) {code} The following error occurs: {code} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} This error usually is a hint that there was a mix up of the old and the new Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to appear before the version that is bundled with Spark, reading avro files works fine. Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3034) [HIve] java.sql.Date cannot be cast to java.sql.Timestamp
[ https://issues.apache.org/jira/browse/SPARK-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pengyanhong updated SPARK-3034: --- Summary: [HIve] java.sql.Date cannot be cast to java.sql.Timestamp (was: java.sql.Date cannot be cast to java.sql.Timestamp) [HIve] java.sql.Date cannot be cast to java.sql.Timestamp - Key: SPARK-3034 URL: https://issues.apache.org/jira/browse/SPARK-3034 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.0.2 Reporter: pengyanhong Priority: Blocker run a simple HiveQL via yarn-cluster, got error as below: {quote} Exception in thread Thread-2 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 141 on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: java.sql.Date cannot be cast to java.sql.Timestamp org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33) org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439) org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[jira] [Updated] (SPARK-3033) [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal
[ https://issues.apache.org/jira/browse/SPARK-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pengyanhong updated SPARK-3033: --- Summary: [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal (was: java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal) [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal Key: SPARK-3033 URL: https://issues.apache.org/jira/browse/SPARK-3033 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.0.2 Reporter: pengyanhong Priority: Blocker run a complex HiveQL via yarn-cluster, got error as below: {quote} 14/08/14 15:05:24 WARN org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to java.lang.ClassCastException java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51) at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022) at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82) at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276) at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84) at org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62) at org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309) at org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {quote} -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API
[ https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096973#comment-14096973 ] Apache Spark commented on SPARK-3039: - User 'bbossy' has created a pull request for this issue: https://github.com/apache/spark/pull/1945 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API -- Key: SPARK-3039 URL: https://issues.apache.org/jira/browse/SPARK-3039 Project: Spark Issue Type: Bug Components: Build, Input/Output, Spark Core Affects Versions: 0.9.1, 1.0.0, 1.1.0 Environment: hadoop2, hadoop-2.4.0, HDP-2.1 Reporter: Bertrand Bossy The spark assembly contains the artifact org.apache.avro:avro-mapred as a dependency of org.spark-project.hive:hive-serde. The avro-mapred package provides a hadoop FileInputFormat to read and write avro files. There are two versions of this package, distinguished by a classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. avro-mapred for the old Hadoop API uses no classifier. E.g. when reading avro files using {code} sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro) {code} The following error occurs: {code} java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} This error usually is a hint that there was a mix up of the old and the new Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to appear before the version that is bundled with Spark, reading avro files works fine. Also, if Spark is built using avro-mapred for hadoop2, it works fine as well. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3040) pick up a more proper local ip address for Utils.findLocalIpAddress method
Ye Xianjin created SPARK-3040: - Summary: pick up a more proper local ip address for Utils.findLocalIpAddress method Key: SPARK-3040 URL: https://issues.apache.org/jira/browse/SPARK-3040 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2 Environment: Mac os x, a bunch of network interfaces: eth0, wlan0, vnic0, vnic1, tun0, lo Reporter: Ye Xianjin Priority: Trivial I noticed this inconvenience when I ran spark-shell with my virtual machines on and VPN service running. There are a lot of network interfaces on my laptop(inactive devices omitted): {quote} lo0: inet 127.0.0.1 en1: inet 192.168.0.102 vnic0: inet 10.211.55.2 (virtual if for vm1) vnic1: inet 10.37.129.3 (virtual if for vm2) tun0: inet 172.16.100.191 -- 172.16.100.191 (tun device for VPN) {quote} In spark core, Utils.findLocalIpAddress() uses NetworkInterface.getNetworkInterfaces to get all active network interfaces, but unfortunately, this method returns network interfaces in reverse order compared to the ifconfig output (both use ioctl sys call). I dug into the openJDK 6 and 7 source code and confirms this behavior(It just happens on unix-like system, windows deals with it and returns in index order). So, the findLocalIpAddress method will pick the ip address associated with tun0 rather than en1 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3040) pick up a more proper local ip address for Utils.findLocalIpAddress method
[ https://issues.apache.org/jira/browse/SPARK-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097115#comment-14097115 ] Apache Spark commented on SPARK-3040: - User 'advancedxy' has created a pull request for this issue: https://github.com/apache/spark/pull/1946 pick up a more proper local ip address for Utils.findLocalIpAddress method -- Key: SPARK-3040 URL: https://issues.apache.org/jira/browse/SPARK-3040 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2 Environment: Mac os x, a bunch of network interfaces: eth0, wlan0, vnic0, vnic1, tun0, lo Reporter: Ye Xianjin Priority: Trivial Original Estimate: 1h Remaining Estimate: 1h I noticed this inconvenience when I ran spark-shell with my virtual machines on and VPN service running. There are a lot of network interfaces on my laptop(inactive devices omitted): {quote} lo0: inet 127.0.0.1 en1: inet 192.168.0.102 vnic0: inet 10.211.55.2 (virtual if for vm1) vnic1: inet 10.37.129.3 (virtual if for vm2) tun0: inet 172.16.100.191 -- 172.16.100.191 (tun device for VPN) {quote} In spark core, Utils.findLocalIpAddress() uses NetworkInterface.getNetworkInterfaces to get all active network interfaces, but unfortunately, this method returns network interfaces in reverse order compared to the ifconfig output (both use ioctl sys call). I dug into the openJDK 6 and 7 source code and confirms this behavior(It just happens on unix-like system, windows deals with it and returns in index order). So, the findLocalIpAddress method will pick the ip address associated with tun0 rather than en1 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate
[ https://issues.apache.org/jira/browse/SPARK-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097193#comment-14097193 ] Sandy Ryza commented on SPARK-3028: --- +1 to what Patrick said. I'll post a patch along those lines. sparkEventToJson should support SparkListenerExecutorMetricsUpdate -- Key: SPARK-3028 URL: https://issues.apache.org/jira/browse/SPARK-3028 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin Priority: Blocker SparkListenerExecutorMetricsUpdate was added without updating org.apache.spark.util.JsonProtocol.sparkEventToJson. This can crash the listener. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3009) ApplicationInfo doesn't get initialised after deserialisation during recovery
[ https://issues.apache.org/jira/browse/SPARK-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097268#comment-14097268 ] Apache Spark commented on SPARK-3009: - User 'jacek-lewandowski' has created a pull request for this issue: https://github.com/apache/spark/pull/1947 ApplicationInfo doesn't get initialised after deserialisation during recovery - Key: SPARK-3009 URL: https://issues.apache.org/jira/browse/SPARK-3009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Jacek Lewandowski The {{readObject}} method has been removed from {{ApplicationInfo}} so that it does not initialise its transient fields properly after deserialisation. It follows throwing NPE during recovery of an application in {{MetricSystem.registerSource}}. As [~andrewor14] said, he removed {{readObject}} method by accident. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2927) Add a conf to configure if we always read Binary columns stored in Parquet as String columns
[ https://issues.apache.org/jira/browse/SPARK-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2927. - Resolution: Fixed Fix Version/s: 1.1.0 Add a conf to configure if we always read Binary columns stored in Parquet as String columns Key: SPARK-2927 URL: https://issues.apache.org/jira/browse/SPARK-2927 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.1.0 Based on Parquet spec (https://github.com/Parquet/parquet-format), strings are stored as byte arrays (binary) with a UTF8 annotation. However, if the data generator does not follow it, we will only read binary values back instead of string values. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3011) _temporary directory should be filtered out by sqlContext.parquetFile
[ https://issues.apache.org/jira/browse/SPARK-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3011. - Resolution: Fixed Fix Version/s: 1.1.0 _temporary directory should be filtered out by sqlContext.parquetFile - Key: SPARK-3011 URL: https://issues.apache.org/jira/browse/SPARK-3011 Project: Spark Issue Type: Bug Components: SQL Reporter: Joseph Su Fix For: 1.1.0 Sometimes _temporary directory is not removed after the file committed on S3. sqlContext.parquetFile will raise because it is trying to read the metadata in _temporary .sqlContext.parquetFile should just ignore the directory. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3041) DecisionTree: isSampleValid indexing incorrect
Joseph K. Bradley created SPARK-3041: Summary: DecisionTree: isSampleValid indexing incorrect Key: SPARK-3041 URL: https://issues.apache.org/jira/browse/SPARK-3041 Project: Spark Issue Type: Bug Components: MLlib Reporter: Joseph K. Bradley In DecisionTree, isSampleValid treats unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories. This bug is exhibited for unordered features (multi-class classification with categorical features of low arity). Proposed fix: Index bins correctly for unordered categorical features. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3043) DecisionTree aggregation is inefficient
Joseph K. Bradley created SPARK-3043: Summary: DecisionTree aggregation is inefficient Key: SPARK-3043 URL: https://issues.apache.org/jira/browse/SPARK-3043 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley 2 major efficiency issues in computation and storage: (1) DecisionTree aggregation involves reshaping data unnecessarily. E.g., the internal methods extractNodeInfo() and getBinDataForNode() involve reshaping the data multiple times without real computation. (2) DecisionTree splits and aggregate bins can include many unused bins/splits. The same number of splits/bins are used for all features. E.g., if there is a continuous feature which uses 100 bins, then there will also be 100 bins allocated for all binary features, even though only 2 are necessary. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3042) DecisionTree filtering is very inefficient
[ https://issues.apache.org/jira/browse/SPARK-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3042: - Assignee: Joseph K. Bradley DecisionTree filtering is very inefficient -- Key: SPARK-3042 URL: https://issues.apache.org/jira/browse/SPARK-3042 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley DecisionTree needs to match each example to a node at each iteration. It currently does this with a set of filters very inefficiently: For each example, it examines each node at the current level and traces up to the root to see if that example should be handled by that node. Proposed fix: Filter top-down using the partly built tree itself. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3043) DecisionTree aggregation is inefficient
[ https://issues.apache.org/jira/browse/SPARK-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3043: - Assignee: Joseph K. Bradley DecisionTree aggregation is inefficient --- Key: SPARK-3043 URL: https://issues.apache.org/jira/browse/SPARK-3043 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley 2 major efficiency issues in computation and storage: (1) DecisionTree aggregation involves reshaping data unnecessarily. E.g., the internal methods extractNodeInfo() and getBinDataForNode() involve reshaping the data multiple times without real computation. (2) DecisionTree splits and aggregate bins can include many unused bins/splits. The same number of splits/bins are used for all features. E.g., if there is a continuous feature which uses 100 bins, then there will also be 100 bins allocated for all binary features, even though only 2 are necessary. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3042) DecisionTree filtering is very inefficient
[ https://issues.apache.org/jira/browse/SPARK-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3042: - Target Version/s: 1.1.0 DecisionTree filtering is very inefficient -- Key: SPARK-3042 URL: https://issues.apache.org/jira/browse/SPARK-3042 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley DecisionTree needs to match each example to a node at each iteration. It currently does this with a set of filters very inefficiently: For each example, it examines each node at the current level and traces up to the root to see if that example should be handled by that node. Proposed fix: Filter top-down using the partly built tree itself. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3041) DecisionTree: isSampleValid indexing incorrect
[ https://issues.apache.org/jira/browse/SPARK-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3041: - Assignee: Joseph K. Bradley DecisionTree: isSampleValid indexing incorrect -- Key: SPARK-3041 URL: https://issues.apache.org/jira/browse/SPARK-3041 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley In DecisionTree, isSampleValid treats unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories. This bug is exhibited for unordered features (multi-class classification with categorical features of low arity). Proposed fix: Index bins correctly for unordered categorical features. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3043) DecisionTree aggregation is inefficient
[ https://issues.apache.org/jira/browse/SPARK-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3043: - Target Version/s: 1.1.0 Affects Version/s: 1.1.0 DecisionTree aggregation is inefficient --- Key: SPARK-3043 URL: https://issues.apache.org/jira/browse/SPARK-3043 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley 2 major efficiency issues in computation and storage: (1) DecisionTree aggregation involves reshaping data unnecessarily. E.g., the internal methods extractNodeInfo() and getBinDataForNode() involve reshaping the data multiple times without real computation. (2) DecisionTree splits and aggregate bins can include many unused bins/splits. The same number of splits/bins are used for all features. E.g., if there is a continuous feature which uses 100 bins, then there will also be 100 bins allocated for all binary features, even though only 2 are necessary. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3041) DecisionTree: isSampleValid indexing incorrect
[ https://issues.apache.org/jira/browse/SPARK-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3041: - Target Version/s: 1.1.0 Affects Version/s: 1.1.0 DecisionTree: isSampleValid indexing incorrect -- Key: SPARK-3041 URL: https://issues.apache.org/jira/browse/SPARK-3041 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley In DecisionTree, isSampleValid treats unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories. This bug is exhibited for unordered features (multi-class classification with categorical features of low arity). Proposed fix: Index bins correctly for unordered categorical features. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3044) Create RSS feed for Spark News
Nicholas Chammas created SPARK-3044: --- Summary: Create RSS feed for Spark News Key: SPARK-3044 URL: https://issues.apache.org/jira/browse/SPARK-3044 Project: Spark Issue Type: Documentation Reporter: Nicholas Chammas Priority: Minor Project updates are often posted here: http://spark.apache.org/news/ Currently, there is no way to subscribe to a feed of these updates. It would be nice there was a way people could be notified of new posts there without having to check manually. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2979) Improve the convergence rate by minimizing the condition number in LOR with LBFGS
[ https://issues.apache.org/jira/browse/SPARK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2979: - Assignee: DB Tsai Improve the convergence rate by minimizing the condition number in LOR with LBFGS - Key: SPARK-2979 URL: https://issues.apache.org/jira/browse/SPARK-2979 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai Assignee: DB Tsai Scaling to minimize the condition number: During the optimization process, the convergence (rate) depends on the condition number of the training dataset. Scaling the variables often reduces this condition number, thus mproving the convergence rate dramatically. Without reducing the condition number, some training datasets mixing the columns with different scales may not be able to converge. GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return the weights in the original scale. See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean), and train the model in the scaled space. Then we transform the coefficients from the scaled space to the original scale as GLMNET and LIBSVM do. Currently, it's only enabled in LogisticRegressionWithLBFGS -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2979) Improve the convergence rate by minimizing the condition number in LOR with LBFGS
[ https://issues.apache.org/jira/browse/SPARK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2979. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1897 [https://github.com/apache/spark/pull/1897] Improve the convergence rate by minimizing the condition number in LOR with LBFGS - Key: SPARK-2979 URL: https://issues.apache.org/jira/browse/SPARK-2979 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai Assignee: DB Tsai Fix For: 1.1.0 Scaling to minimize the condition number: During the optimization process, the convergence (rate) depends on the condition number of the training dataset. Scaling the variables often reduces this condition number, thus mproving the convergence rate dramatically. Without reducing the condition number, some training datasets mixing the columns with different scales may not be able to converge. GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return the weights in the original scale. See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf Here, if useFeatureScaling is enabled, we will standardize the training features by dividing the variance of each column (without subtracting the mean), and train the model in the scaled space. Then we transform the coefficients from the scaled space to the original scale as GLMNET and LIBSVM do. Currently, it's only enabled in LogisticRegressionWithLBFGS -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3046) Set executor's class loader as the default serializer class loader
Reynold Xin created SPARK-3046: -- Summary: Set executor's class loader as the default serializer class loader Key: SPARK-3046 URL: https://issues.apache.org/jira/browse/SPARK-3046 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Reynold Xin This is an attempt to fix the problem outlined in SPARK-2878. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3045) Make Serializer interface Java friendly
Reynold Xin created SPARK-3045: -- Summary: Make Serializer interface Java friendly Key: SPARK-3045 URL: https://issues.apache.org/jira/browse/SPARK-3045 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Reynold Xin The various Scheduler* traits have default implementations in them, which makes them hard to extend in Java. We should turn those traits into AbstractClass instead. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3045) Make Serializer interface Java friendly
[ https://issues.apache.org/jira/browse/SPARK-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097503#comment-14097503 ] Apache Spark commented on SPARK-3045: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/1948 Make Serializer interface Java friendly --- Key: SPARK-3045 URL: https://issues.apache.org/jira/browse/SPARK-3045 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Reynold Xin The various Scheduler* traits have default implementations in them, which makes them hard to extend in Java. We should turn those traits into AbstractClass instead. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3046) Set executor's class loader as the default serializer class loader
[ https://issues.apache.org/jira/browse/SPARK-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097504#comment-14097504 ] Apache Spark commented on SPARK-3046: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/1948 Set executor's class loader as the default serializer class loader -- Key: SPARK-3046 URL: https://issues.apache.org/jira/browse/SPARK-3046 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Reynold Xin This is an attempt to fix the problem outlined in SPARK-2878. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3047) Use utf-8 for textFile() by default
Davies Liu created SPARK-3047: - Summary: Use utf-8 for textFile() by default Key: SPARK-3047 URL: https://issues.apache.org/jira/browse/SPARK-3047 Project: Spark Issue Type: Improvement Reporter: Davies Liu In Python 2.x, most of the string are bytearray, it's more efficient then unicode (both cpu and memory). UTF-8 is the default encoding. After disable decode from utf8 into unicode in UTF8Deserializer, the total time for wc job is reduce by 32% (from 2m17s to 1m34s). We also could add argument unicode=False to textFile(). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3011) _temporary directory should be filtered out by sqlContext.parquetFile
[ https://issues.apache.org/jira/browse/SPARK-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097537#comment-14097537 ] Apache Spark commented on SPARK-3011: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/1949 _temporary directory should be filtered out by sqlContext.parquetFile - Key: SPARK-3011 URL: https://issues.apache.org/jira/browse/SPARK-3011 Project: Spark Issue Type: Bug Components: SQL Reporter: Joseph Su Fix For: 1.1.0 Sometimes _temporary directory is not removed after the file committed on S3. sqlContext.parquetFile will raise because it is trying to read the metadata in _temporary .sqlContext.parquetFile should just ignore the directory. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3041) DecisionTree: isSampleValid indexing incorrect
[ https://issues.apache.org/jira/browse/SPARK-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097564#comment-14097564 ] Apache Spark commented on SPARK-3041: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/1950 DecisionTree: isSampleValid indexing incorrect -- Key: SPARK-3041 URL: https://issues.apache.org/jira/browse/SPARK-3041 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley In DecisionTree, isSampleValid treats unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories. This bug is exhibited for unordered features (multi-class classification with categorical features of low arity). Proposed fix: Index bins correctly for unordered categorical features. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3022) FindBinsForLevel in decision tree should call findBin only once for each feature
[ https://issues.apache.org/jira/browse/SPARK-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097563#comment-14097563 ] Apache Spark commented on SPARK-3022: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/1950 FindBinsForLevel in decision tree should call findBin only once for each feature Key: SPARK-3022 URL: https://issues.apache.org/jira/browse/SPARK-3022 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.2 Reporter: Qiping Li Original Estimate: 4h Remaining Estimate: 4h `findbinsForLevel` is applied to every `LabeledPoint` to find bins for all nodes at a given level. Given a specific `LabeledPoint` and a specific feature, the bin to put this labeled point should always be same.But in current implementation, `findBin` on a (labeledpoint, feature) pair is called for every node at a given level, which is a waste of computation. I proposed to call `findBin` only once and if a `LabeledPoint` is valid on a node, this result can be reused. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3047) add an option to use str in textFileRDD()
[ https://issues.apache.org/jira/browse/SPARK-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-3047: -- Summary: add an option to use str in textFileRDD() (was: Use utf-8 for textFile() by default) add an option to use str in textFileRDD() - Key: SPARK-3047 URL: https://issues.apache.org/jira/browse/SPARK-3047 Project: Spark Issue Type: Improvement Reporter: Davies Liu In Python 2.x, most of the string are bytearray, it's more efficient then unicode (both cpu and memory). UTF-8 is the default encoding. After disable decode from utf8 into unicode in UTF8Deserializer, the total time for wc job is reduce by 32% (from 2m17s to 1m34s). We also could add argument unicode=False to textFile(). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3047) add an option to use str in textFileRDD()
[ https://issues.apache.org/jira/browse/SPARK-3047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097639#comment-14097639 ] Apache Spark commented on SPARK-3047: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/1951 add an option to use str in textFileRDD() - Key: SPARK-3047 URL: https://issues.apache.org/jira/browse/SPARK-3047 Project: Spark Issue Type: Improvement Reporter: Davies Liu Assignee: Davies Liu In Python 2.x, most of the string are bytearray, it's more efficient then unicode (both cpu and memory). UTF-8 is the default encoding. After disable decode from utf8 into unicode in UTF8Deserializer, the total time for wc job is reduce by 32% (from 2m17s to 1m34s). We also could add argument unicode=False to textFile(). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2736) Create PySpark RDD from Apache Avro File
[ https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2736: - Priority: Major (was: Minor) Create PySpark RDD from Apache Avro File Key: SPARK-2736 URL: https://issues.apache.org/jira/browse/SPARK-2736 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Eric Garcia Assignee: Kan Zhang Original Estimate: 4h Remaining Estimate: 4h There is a partially working example Avro Converter at this pull request: https://github.com/apache/spark/pull/1536 It does not fully implement all types in the Avro format and could be cleaned up a little bit. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2736) Create PySpark RDD from Apache Avro File
[ https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097701#comment-14097701 ] Matei Zaharia commented on SPARK-2736: -- I bumped this up to Major because the PR also contains an important bug fix. Create PySpark RDD from Apache Avro File Key: SPARK-2736 URL: https://issues.apache.org/jira/browse/SPARK-2736 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Eric Garcia Assignee: Kan Zhang Original Estimate: 4h Remaining Estimate: 4h There is a partially working example Avro Converter at this pull request: https://github.com/apache/spark/pull/1536 It does not fully implement all types in the Avro format and could be cleaned up a little bit. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3048) Make LabeledPointParser public
Xiangrui Meng created SPARK-3048: Summary: Make LabeledPointParser public Key: SPARK-3048 URL: https://issues.apache.org/jira/browse/SPARK-3048 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng `LabeledPointParser` is used in `MLUtils.loadLabeledPoint`. Making it public may be useful to parse labeled points from other data sources, e.g., DStream. Now we have a util function called `MLUtils.loadStreamingLabeledPoints`, which increases the complexity of the API. I really want to make `parse` as a method of `LabeledPoint$` instead of creating a new class `LabeledPointParser`, but it is hard to extend a case class with a static method. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3049) Make sure client doesn't block when server/connection has error(s)
Reynold Xin created SPARK-3049: -- Summary: Make sure client doesn't block when server/connection has error(s) Key: SPARK-3049 URL: https://issues.apache.org/jira/browse/SPARK-3049 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3048) Make LabeledPointParser public
[ https://issues.apache.org/jira/browse/SPARK-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3048: - Description: `LabeledPointParser` is used in `MLUtils.loadLabeledPoint`. Making it public may be useful to parse labeled points from other data sources, e.g., DStream. Now we have a util function called `MLUtils.loadStreamingLabeledPoints`, which increases the complexity of the API. I really want to make `parse` as a method of `LabeledPoint$` instead of creating a new class `LabeledPointParser`, but it is hard to extend a case class with a static method in a binary compatible way. (was: `LabeledPointParser` is used in `MLUtils.loadLabeledPoint`. Making it public may be useful to parse labeled points from other data sources, e.g., DStream. Now we have a util function called `MLUtils.loadStreamingLabeledPoints`, which increases the complexity of the API. I really want to make `parse` as a method of `LabeledPoint$` instead of creating a new class `LabeledPointParser`, but it is hard to extend a case class with a static method.) Make LabeledPointParser public -- Key: SPARK-3048 URL: https://issues.apache.org/jira/browse/SPARK-3048 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng `LabeledPointParser` is used in `MLUtils.loadLabeledPoint`. Making it public may be useful to parse labeled points from other data sources, e.g., DStream. Now we have a util function called `MLUtils.loadStreamingLabeledPoints`, which increases the complexity of the API. I really want to make `parse` as a method of `LabeledPoint$` instead of creating a new class `LabeledPointParser`, but it is hard to extend a case class with a static method in a binary compatible way. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3050) Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster
Mingyu Kim created SPARK-3050: - Summary: Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster Key: SPARK-3050 URL: https://issues.apache.org/jira/browse/SPARK-3050 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Reporter: Mingyu Kim Priority: Critical I ran the following code with Spark 1.0.2 jar against a cluster that runs Spark 1.0.1 (i.e. localhost:7077 is running 1.0.1). {code} import java.util.ArrayList; import java.util.List; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class TestTest { public static void main(String[] args) { JavaSparkContext sc = new JavaSparkContext(spark://localhost:7077, Test); ListInteger list = new ArrayList(); list.add(1); list.add(2); list.add(3); JavaRDDInteger rdd = sc.parallelize(list); System.out.println(rdd.collect()); } } {code} This throws InvalidClassException. {code} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 4 times, most recent failure: Exception failure in TID 6 on host 10.100.91.90: java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class incompatible: stream classdesc serialVersionUID = -6766554341038829528, local class serialVersionUID = 385418487991259089 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:604) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1835) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1794) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:722) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at
[jira] [Commented] (SPARK-3048) Make LabeledPointParser public
[ https://issues.apache.org/jira/browse/SPARK-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097792#comment-14097792 ] Apache Spark commented on SPARK-3048: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/1952 Make LabeledPointParser public -- Key: SPARK-3048 URL: https://issues.apache.org/jira/browse/SPARK-3048 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng `LabeledPointParser` is used in `MLUtils.loadLabeledPoint`. Making it public may be useful to parse labeled points from other data sources, e.g., DStream. Now we have a util function called `MLUtils.loadStreamingLabeledPoints`, which increases the complexity of the API. I really want to make `parse` as a method of `LabeledPoint$` instead of creating a new class `LabeledPointParser`, but it is hard to extend a case class with a static method in a binary compatible way. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor
[ https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097808#comment-14097808 ] Jim Blomo commented on SPARK-1284: -- Hi, having trouble compiling either master or branch-1.1, I sent a request to the mailing list for help. Are there any compiled snapshots? pyspark hangs after IOError on Executor --- Key: SPARK-1284 URL: https://issues.apache.org/jira/browse/SPARK-1284 Project: Spark Issue Type: Bug Components: PySpark Reporter: Jim Blomo Assignee: Davies Liu When running a reduceByKey over a cached RDD, Python fails with an exception, but the failure is not detected by the task runner. Spark and the pyspark shell hang waiting for the task to finish. The error is: {code} PySpark worker failed with exception: Traceback (most recent call last): File /home/hadoop/spark/python/pyspark/worker.py, line 77, in main serializer.dump_stream(func(split_index, iterator), outfile) File /home/hadoop/spark/python/pyspark/serializers.py, line 182, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /home/hadoop/spark/python/pyspark/serializers.py, line 118, in dump_stream self._write_with_length(obj, stream) File /home/hadoop/spark/python/pyspark/serializers.py, line 130, in _write_with_length stream.write(serialized) IOError: [Errno 104] Connection reset by peer 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 4257 bytes in 47 ms Traceback (most recent call last): File /home/hadoop/spark/python/pyspark/daemon.py, line 117, in launch_worker worker(listen_sock) File /home/hadoop/spark/python/pyspark/daemon.py, line 107, in worker outfile.flush() IOError: [Errno 32] Broken pipe {code} I can reproduce the error by running take(10) on the cached RDD before running reduceByKey (which looks at the whole input file). Affects Version 1.0.0-SNAPSHOT (4d88030486) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored
[ https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097826#comment-14097826 ] Sandy Ryza commented on SPARK-2089: --- H, it's true that my suggestion would require us to serialize and then immediately deserialize a possibly huge string. How about Spark conf properties that just specify the input file and input format, and handles all the logic for converting this to location preferences on the other side. This would also simplify things for the users (just need to set properties, not call any methods). With YARN, preferredNodeLocalityData isn't honored --- Key: SPARK-2089 URL: https://issues.apache.org/jira/browse/SPARK-2089 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical When running in YARN cluster mode, apps can pass preferred locality data when constructing a Spark context that will dictate where to request executor containers. This is currently broken because of a race condition. The Spark-YARN code runs the user class and waits for it to start up a SparkContext. During its initialization, the SparkContext will create a YarnClusterScheduler, which notifies a monitor in the Spark-YARN code that . The Spark-Yarn code then immediately fetches the preferredNodeLocationData from the SparkContext and uses it to start requesting containers. But in the SparkContext constructor that takes the preferredNodeLocationData, setting preferredNodeLocationData comes after the rest of the initialization, so, if the Spark-YARN code comes around quickly enough after being notified, the data that's fetched is the empty unset version. The occurred during all of my runs. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored
[ https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097826#comment-14097826 ] Sandy Ryza edited comment on SPARK-2089 at 8/14/14 10:41 PM: - H, it's true that my suggestion would require us to serialize and then immediately deserialize a possibly huge string. How about Spark conf properties that just specify the input file and input format? We would handle all the logic for converting this to location preferences on the other side. This would also simplify things for the users (just need to set properties, not call any methods). was (Author: sandyr): H, it's true that my suggestion would require us to serialize and then immediately deserialize a possibly huge string. How about Spark conf properties that just specify the input file and input format, and handles all the logic for converting this to location preferences on the other side. This would also simplify things for the users (just need to set properties, not call any methods). With YARN, preferredNodeLocalityData isn't honored --- Key: SPARK-2089 URL: https://issues.apache.org/jira/browse/SPARK-2089 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical When running in YARN cluster mode, apps can pass preferred locality data when constructing a Spark context that will dictate where to request executor containers. This is currently broken because of a race condition. The Spark-YARN code runs the user class and waits for it to start up a SparkContext. During its initialization, the SparkContext will create a YarnClusterScheduler, which notifies a monitor in the Spark-YARN code that . The Spark-Yarn code then immediately fetches the preferredNodeLocationData from the SparkContext and uses it to start requesting containers. But in the SparkContext constructor that takes the preferredNodeLocationData, setting preferredNodeLocationData comes after the rest of the initialization, so, if the Spark-YARN code comes around quickly enough after being notified, the data that's fetched is the empty unset version. The occurred during all of my runs. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3051) Support looking-up named accumulators in a registry
Neil Ferguson created SPARK-3051: Summary: Support looking-up named accumulators in a registry Key: SPARK-3051 URL: https://issues.apache.org/jira/browse/SPARK-3051 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Neil Ferguson This is a proposed enhancement to Spark based on the following mailing list discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/quot-Dynamic-variables-quot-in-Spark-td7450.html. This proposal builds on SPARK-2380 (Support displaying accumulator values in the web UI) to allow named accumulables to be looked-up in a registry, as opposed to having to be passed to every method that need to access them. The use case was described well by [~shivaram], as follows: Lets say you have two functions you use in a map call and want to measure how much time each of them takes. For example, if you have a code block like the one below and you want to measure how much time f1 takes as a fraction of the task. {noformat} a.map { l = val f = f1(l) ... some work here ... } {noformat} It would be really cool if we could do something like {noformat} a.map { l = val start = System.nanoTime val f = f1(l) TaskMetrics.get(f1-time).add(System.nanoTime - start) } {noformat} SPARK-2380 provides a partial solution to this problem -- however the accumulables would still need to be passed to every function that needs them, which I think would be cumbersome in any application of reasonable complexity. The proposal, as suggested by [~pwendell], is to have a registry of accumulables, that can be looked-up by name. Regarding the implementation details, I'd propose that we broadcast a serialized version of all named accumulables in the DAGScheduler (similar to what SPARK-2521 does for Tasks). These can then be deserialized in the Executor. Accumulables are already stored in thread-local variables in the Accumulators object, so exposing these in the registry should be simply a matter of wrapping this object, and keying the accumulables by name (they are currently keyed by ID). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3051) Support looking-up named accumulators in a registry
[ https://issues.apache.org/jira/browse/SPARK-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097893#comment-14097893 ] Neil Ferguson commented on SPARK-3051: -- I've done an initial prototype of this in the following commit: https://github.com/nfergu/spark/commit/fffaa5f9d0e6a0c1b7cfcd82867e733369296e07 It needs more tidy-up, documentation, and testing. Support looking-up named accumulators in a registry --- Key: SPARK-3051 URL: https://issues.apache.org/jira/browse/SPARK-3051 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Neil Ferguson This is a proposed enhancement to Spark based on the following mailing list discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/quot-Dynamic-variables-quot-in-Spark-td7450.html. This proposal builds on SPARK-2380 (Support displaying accumulator values in the web UI) to allow named accumulables to be looked-up in a registry, as opposed to having to be passed to every method that need to access them. The use case was described well by [~shivaram], as follows: Lets say you have two functions you use in a map call and want to measure how much time each of them takes. For example, if you have a code block like the one below and you want to measure how much time f1 takes as a fraction of the task. {noformat} a.map { l = val f = f1(l) ... some work here ... } {noformat} It would be really cool if we could do something like {noformat} a.map { l = val start = System.nanoTime val f = f1(l) TaskMetrics.get(f1-time).add(System.nanoTime - start) } {noformat} SPARK-2380 provides a partial solution to this problem -- however the accumulables would still need to be passed to every function that needs them, which I think would be cumbersome in any application of reasonable complexity. The proposal, as suggested by [~pwendell], is to have a registry of accumulables, that can be looked-up by name. Regarding the implementation details, I'd propose that we broadcast a serialized version of all named accumulables in the DAGScheduler (similar to what SPARK-2521 does for Tasks). These can then be deserialized in the Executor. Accumulables are already stored in thread-local variables in the Accumulators object, so exposing these in the registry should be simply a matter of wrapping this object, and keying the accumulables by name (they are currently keyed by ID). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3052) Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop
Sandy Ryza created SPARK-3052: - Summary: Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop Key: SPARK-3052 URL: https://issues.apache.org/jira/browse/SPARK-3052 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3052) Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop
[ https://issues.apache.org/jira/browse/SPARK-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097948#comment-14097948 ] Apache Spark commented on SPARK-3052: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/1956 Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop --- Key: SPARK-3052 URL: https://issues.apache.org/jira/browse/SPARK-3052 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3050) Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster
[ https://issues.apache.org/jira/browse/SPARK-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097947#comment-14097947 ] Patrick Wendell commented on SPARK-3050: Hi [~mkim] - when you launch jobs in client mode (i.e. the driver is running where you launch the job), you will need to use an update version of the spark-submit launch script. You don't need to recompile your job though, you just need to update spark-submit in tandem with your cluster upgrade. Does that work? Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster --- Key: SPARK-3050 URL: https://issues.apache.org/jira/browse/SPARK-3050 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Reporter: Mingyu Kim Priority: Critical I ran the following code with Spark 1.0.2 jar against a cluster that runs Spark 1.0.1 (i.e. localhost:7077 is running 1.0.1). {code} import java.util.ArrayList; import java.util.List; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class TestTest { public static void main(String[] args) { JavaSparkContext sc = new JavaSparkContext(spark://localhost:7077, Test); ListInteger list = new ArrayList(); list.add(1); list.add(2); list.add(3); JavaRDDInteger rdd = sc.parallelize(list); System.out.println(rdd.collect()); } } {code} This throws InvalidClassException. {code} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 4 times, most recent failure: Exception failure in TID 6 on host 10.100.91.90: java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class incompatible: stream classdesc serialVersionUID = -6766554341038829528, local class serialVersionUID = 385418487991259089 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:604) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1835) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1794) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:722) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at
[jira] [Resolved] (SPARK-3050) Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster
[ https://issues.apache.org/jira/browse/SPARK-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3050. Resolution: Not a Problem I think the issue here is just needing to use the newer version of the launch script. Feel free to re-open this though if I am misunderstanding. Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster --- Key: SPARK-3050 URL: https://issues.apache.org/jira/browse/SPARK-3050 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Reporter: Mingyu Kim Priority: Critical I ran the following code with Spark 1.0.2 jar against a cluster that runs Spark 1.0.1 (i.e. localhost:7077 is running 1.0.1). {code} import java.util.ArrayList; import java.util.List; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class TestTest { public static void main(String[] args) { JavaSparkContext sc = new JavaSparkContext(spark://localhost:7077, Test); ListInteger list = new ArrayList(); list.add(1); list.add(2); list.add(3); JavaRDDInteger rdd = sc.parallelize(list); System.out.println(rdd.collect()); } } {code} This throws InvalidClassException. {code} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 4 times, most recent failure: Exception failure in TID 6 on host 10.100.91.90: java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class incompatible: stream classdesc serialVersionUID = -6766554341038829528, local class serialVersionUID = 385418487991259089 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:604) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1835) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1794) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:722) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at
[jira] [Created] (SPARK-3053) Reconcile spark.files.userClassPathFirst with spark.yarn.user.classpath.first
Sandy Ryza created SPARK-3053: - Summary: Reconcile spark.files.userClassPathFirst with spark.yarn.user.classpath.first Key: SPARK-3053 URL: https://issues.apache.org/jira/browse/SPARK-3053 Project: Spark Issue Type: Improvement Affects Versions: 1.0.2 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3050) Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster
[ https://issues.apache.org/jira/browse/SPARK-3050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3050: --- Priority: Major (was: Critical) Spark program running with 1.0.2 jar cannot run against a 1.0.1 cluster --- Key: SPARK-3050 URL: https://issues.apache.org/jira/browse/SPARK-3050 Project: Spark Issue Type: Bug Affects Versions: 1.0.2 Reporter: Mingyu Kim I ran the following code with Spark 1.0.2 jar against a cluster that runs Spark 1.0.1 (i.e. localhost:7077 is running 1.0.1). {code} import java.util.ArrayList; import java.util.List; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class TestTest { public static void main(String[] args) { JavaSparkContext sc = new JavaSparkContext(spark://localhost:7077, Test); ListInteger list = new ArrayList(); list.add(1); list.add(2); list.add(3); JavaRDDInteger rdd = sc.parallelize(list); System.out.println(rdd.collect()); } } {code} This throws InvalidClassException. {code} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 4 times, most recent failure: Exception failure in TID 6 on host 10.100.91.90: java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class incompatible: stream classdesc serialVersionUID = -6766554341038829528, local class serialVersionUID = 385418487991259089 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:604) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1620) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1515) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1769) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1835) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1794) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:722) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[jira] [Created] (SPARK-3054) Add tests for SparkSink
Hari Shreedharan created SPARK-3054: --- Summary: Add tests for SparkSink Key: SPARK-3054 URL: https://issues.apache.org/jira/browse/SPARK-3054 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3054) Add tests for SparkSink
[ https://issues.apache.org/jira/browse/SPARK-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097973#comment-14097973 ] Apache Spark commented on SPARK-3054: - User 'harishreedharan' has created a pull request for this issue: https://github.com/apache/spark/pull/1958 Add tests for SparkSink --- Key: SPARK-3054 URL: https://issues.apache.org/jira/browse/SPARK-3054 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan Assignee: Hari Shreedharan -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2213) Sort Merge Join
[ https://issues.apache.org/jira/browse/SPARK-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097995#comment-14097995 ] Cheng Hao commented on SPARK-2213: -- Sort Merge Join depends on the reduce side sort merge, which was done in SPARK-2926. Sort Merge Join --- Key: SPARK-2213 URL: https://issues.apache.org/jira/browse/SPARK-2213 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3055) Stack trace logged in driver on job failure is usually uninformative
Sandy Ryza created SPARK-3055: - Summary: Stack trace logged in driver on job failure is usually uninformative Key: SPARK-3055 URL: https://issues.apache.org/jira/browse/SPARK-3055 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2 Reporter: Sandy Ryza Priority: Minor {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:5 failed 4 times, most recent failure: TID 24 on host hddn04.lsrc.duke.edu failed for unknown reason Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} At a cursory glance, I would expect the stack trace to have something to be where the task error occurred. In fact it's where the driver became aware of the error and decided to fail the job. This has been a common point of confusion among our customers. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3056) Sort-based Aggregation
Cheng Hao created SPARK-3056: Summary: Sort-based Aggregation Key: SPARK-3056 URL: https://issues.apache.org/jira/browse/SPARK-3056 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Currently, SparkSQL only support the hash-based aggregation, which may cause OOM if too many identical keys in the input tuples. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org