[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures
[ https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544987#comment-14544987 ] Josh Rosen commented on SPARK-7660: --- Note that this affects more than just Spark 1.4.0; I'll trace back and figure out the complete list of affected versions tomorrow, but I think that any version that relied on a Snappy-java library published after mid June or July 2014 may be affected. Snappy-java buffer-sharing bug leads to data corruption / test failures --- Key: SPARK-7660 URL: https://issues.apache.org/jira/browse/SPARK-7660 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.4.0 Reporter: Josh Rosen Priority: Blocker snappy-java contains a bug that can lead to situations where separate SnappyOutputStream instances end up sharing the same input and output buffers, which can lead to data corruption issues. See https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue. I discovered this issue because the buffer-sharing was leading to a test failure in JavaAPISuite: one of the repartition-and-sort tests was returning the wrong answer because both tasks wrote their output using the same compression buffers and one task won the race, causing its output to be written to both shuffle output files. As a result, the test returned the result of collecting one partition twice. The buffer-sharing can only occur if {{close()}} is called twice on the same SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a more precise description of when this issue may occur, see my upstream tickets). I think that this double-close happens somewhere in some test code that was added as part of my Tungsten shuffle patch, exposing this bug (to see this, download a recent build of master and run https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to force the test execution order that triggers the bug). I think that it's rare that this bug would lead to silent failures like this. In more realistic workloads that aren't writing only a handful of bytes per task, I would expect this issue to lead to stream corruption issues like SPARK-4105. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures
[ https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545005#comment-14545005 ] Josh Rosen commented on SPARK-7660: --- I pushed https://github.com/apache/spark/commit/7da33ce5057ff965eec19ce662465b64a3564019 as a hotfix, which masks the bug in a way that fixes the JavaAPISuite Jenkins failures. We'll still fix this bug before 1.4, but in the meantime this will make it easy to recognize new Jenkins failures. Snappy-java buffer-sharing bug leads to data corruption / test failures --- Key: SPARK-7660 URL: https://issues.apache.org/jira/browse/SPARK-7660 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.4.0 Reporter: Josh Rosen Priority: Blocker snappy-java contains a bug that can lead to situations where separate SnappyOutputStream instances end up sharing the same input and output buffers, which can lead to data corruption issues. See https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue. I discovered this issue because the buffer-sharing was leading to a test failure in JavaAPISuite: one of the repartition-and-sort tests was returning the wrong answer because both tasks wrote their output using the same compression buffers and one task won the race, causing its output to be written to both shuffle output files. As a result, the test returned the result of collecting one partition twice (see https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more details). The buffer-sharing can only occur if {{close()}} is called twice on the same SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a more precise description of when this issue may occur, see my upstream tickets). I think that this double-close happens somewhere in some test code that was added as part of my Tungsten shuffle patch, exposing this bug (to see this, download a recent build of master and run https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to force the test execution order that triggers the bug). I think that it's rare that this bug would lead to silent failures like this. In more realistic workloads that aren't writing only a handful of bytes per task, I would expect this issue to lead to stream corruption issues like SPARK-4105. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures
[ https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-7660: - Assignee: Josh Rosen Snappy-java buffer-sharing bug leads to data corruption / test failures --- Key: SPARK-7660 URL: https://issues.apache.org/jira/browse/SPARK-7660 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.4.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker snappy-java contains a bug that can lead to situations where separate SnappyOutputStream instances end up sharing the same input and output buffers, which can lead to data corruption issues. See https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue. I discovered this issue because the buffer-sharing was leading to a test failure in JavaAPISuite: one of the repartition-and-sort tests was returning the wrong answer because both tasks wrote their output using the same compression buffers and one task won the race, causing its output to be written to both shuffle output files. As a result, the test returned the result of collecting one partition twice (see https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more details). The buffer-sharing can only occur if {{close()}} is called twice on the same SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a more precise description of when this issue may occur, see my upstream tickets). I think that this double-close happens somewhere in some test code that was added as part of my Tungsten shuffle patch, exposing this bug (to see this, download a recent build of master and run https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to force the test execution order that triggers the bug). I think that it's rare that this bug would lead to silent failures like this. In more realistic workloads that aren't writing only a handful of bytes per task, I would expect this issue to lead to stream corruption issues like SPARK-4105. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7662) Exception of multi-attribute generator anlysis in projection
[ https://issues.apache.org/jira/browse/SPARK-7662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545072#comment-14545072 ] Apache Spark commented on SPARK-7662: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/6178 Exception of multi-attribute generator anlysis in projection Key: SPARK-7662 URL: https://issues.apache.org/jira/browse/SPARK-7662 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Blocker {code} select explode(map(value, key)) from src; {code} It throws exception like {panel} org.apache.spark.sql.AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got _c0 ; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7662) Exception of multi-attribute generator anlysis in projection
[ https://issues.apache.org/jira/browse/SPARK-7662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7662: --- Assignee: Apache Spark Exception of multi-attribute generator anlysis in projection Key: SPARK-7662 URL: https://issues.apache.org/jira/browse/SPARK-7662 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Apache Spark Priority: Blocker {code} select explode(map(value, key)) from src; {code} It throws exception like {panel} org.apache.spark.sql.AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got _c0 ; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7662) Exception of multi-attribute generator anlysis in projection
[ https://issues.apache.org/jira/browse/SPARK-7662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7662: --- Assignee: (was: Apache Spark) Exception of multi-attribute generator anlysis in projection Key: SPARK-7662 URL: https://issues.apache.org/jira/browse/SPARK-7662 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Blocker {code} select explode(map(value, key)) from src; {code} It throws exception like {panel} org.apache.spark.sql.AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got _c0 ; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures
Josh Rosen created SPARK-7660: - Summary: Snappy-java buffer-sharing bug leads to data corruption / test failures Key: SPARK-7660 URL: https://issues.apache.org/jira/browse/SPARK-7660 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.4.0 Reporter: Josh Rosen Priority: Blocker snappy-java contains a bug that can lead to situations where separate SnappyOutputStream instances end up sharing the same input and output buffers, which can lead to data corruption issues. See https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue. I discovered this issue because the buffer-sharing was leading to a test failure in JavaAPISuite: one of the repartition-and-sort tests was returning the wrong answer because both tasks wrote their output using the same compression buffers and one task won the race, causing its output to be written to both shuffle output files. As a result, the test returned the result of collecting one partition twice. The buffer-sharing can only occur if {{close()}} is called twice on the same SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a more precise description of when this issue may occur, see my upstream tickets). I think that this double-close happens somewhere in some test code that was added as part of my Tungsten shuffle patch, exposing this bug (to see this, download a recent build of master and run https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to force the test execution order that triggers the bug). I think that it's rare that this bug would lead to silent failures like this. In more realistic workloads that aren't writing only a handful of bytes per task, I would expect this issue to lead to stream corruption issues like SPARK-4105. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545027#comment-14545027 ] Alex commented on SPARK-2344: - Hi, How are you? I have couple of questions: 1) When are you planning to submit the FCM to the main spark branch? (I'm interested working on top of it for Feature Weight FCM improvements) 2) How to know if there is a way for Spark to make the RDD distribution based on input data columns rather then rows ? Thanks, Alex Add Fuzzy C-Means algorithm to MLlib Key: SPARK-2344 URL: https://issues.apache.org/jira/browse/SPARK-2344 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Alex Priority: Minor Labels: clustering Original Estimate: 1m Remaining Estimate: 1m I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib. FCM is very similar to K - Means which is already implemented, and they differ only in the degree of relationship each point has with each cluster: (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1. As part of the implementation I would like: - create a base class for K- Means and FCM - implement the relationship for each algorithm differently (in its class) I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6747) Support List as a return type in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545077#comment-14545077 ] Apache Spark commented on SPARK-6747: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/6179 Support List as a return type in Hive UDF --- Key: SPARK-6747 URL: https://issues.apache.org/jira/browse/SPARK-6747 Project: Spark Issue Type: Bug Components: SQL Reporter: Takeshi Yamamuro The current implementation can't handle List as a return type in Hive UDF. We assume an UDF below; public class UDFToListString extends UDF { public ListString evaluate(Object o) { return Arrays.asList(xxx, yyy, zzz); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... To fix this problem, we need to add an entry for List in HiveInspectors#javaClassToDataType. However, it has one difficulty because of type erasure in JVM. We assume that lines below are appended in HiveInspectors#javaClassToDataType; // list type case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] = val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType] println(tpe.getActualTypeArguments()(0).toString()) = 'E' This logic fails to catch a component type in List. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures
[ https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-7660: -- Description: snappy-java contains a bug that can lead to situations where separate SnappyOutputStream instances end up sharing the same input and output buffers, which can lead to data corruption issues. See https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue. I discovered this issue because the buffer-sharing was leading to a test failure in JavaAPISuite: one of the repartition-and-sort tests was returning the wrong answer because both tasks wrote their output using the same compression buffers and one task won the race, causing its output to be written to both shuffle output files. As a result, the test returned the result of collecting one partition twice (see https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more details). The buffer-sharing can only occur if {{close()}} is called twice on the same SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a more precise description of when this issue may occur, see my upstream tickets). I think that this double-close happens somewhere in some test code that was added as part of my Tungsten shuffle patch, exposing this bug (to see this, download a recent build of master and run https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to force the test execution order that triggers the bug). I think that it's rare that this bug would lead to silent failures like this. In more realistic workloads that aren't writing only a handful of bytes per task, I would expect this issue to lead to stream corruption issues like SPARK-4105. was: snappy-java contains a bug that can lead to situations where separate SnappyOutputStream instances end up sharing the same input and output buffers, which can lead to data corruption issues. See https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue. I discovered this issue because the buffer-sharing was leading to a test failure in JavaAPISuite: one of the repartition-and-sort tests was returning the wrong answer because both tasks wrote their output using the same compression buffers and one task won the race, causing its output to be written to both shuffle output files. As a result, the test returned the result of collecting one partition twice. The buffer-sharing can only occur if {{close()}} is called twice on the same SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a more precise description of when this issue may occur, see my upstream tickets). I think that this double-close happens somewhere in some test code that was added as part of my Tungsten shuffle patch, exposing this bug (to see this, download a recent build of master and run https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to force the test execution order that triggers the bug). I think that it's rare that this bug would lead to silent failures like this. In more realistic workloads that aren't writing only a handful of bytes per task, I would expect this issue to lead to stream corruption issues like SPARK-4105. Snappy-java buffer-sharing bug leads to data corruption / test failures --- Key: SPARK-7660 URL: https://issues.apache.org/jira/browse/SPARK-7660 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.4.0 Reporter: Josh Rosen Priority: Blocker snappy-java contains a bug that can lead to situations where separate SnappyOutputStream instances end up sharing the same input and output buffers, which can lead to data corruption issues. See https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue. I discovered this issue because the buffer-sharing was leading to a test failure in JavaAPISuite: one of the repartition-and-sort tests was returning the wrong answer because both tasks wrote their output using the same compression buffers and one task won the race, causing its output to be written to both shuffle output files. As a result, the test returned the result of collecting one partition twice (see https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more details). The buffer-sharing can only occur if {{close()}} is called twice on the same SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a more precise description of when this
[jira] [Updated] (SPARK-6258) Python MLlib API missing items: Clustering
[ https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6258: - Fix Version/s: 1.4.0 Python MLlib API missing items: Clustering -- Key: SPARK-6258 URL: https://issues.apache.org/jira/browse/SPARK-6258 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Yanbo Liang Fix For: 1.4.0 This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k GaussianMixture * setInitialModel GaussianMixtureModel * k Completely missing items which should be fixed in separate JIRAs (which have been created and linked to the umbrella JIRA) * LDA * PowerIterationClustering * StreamingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7591) FSBasedRelation interface tweaks
[ https://issues.apache.org/jira/browse/SPARK-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-7591. --- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6150 [https://github.com/apache/spark/pull/6150] FSBasedRelation interface tweaks Key: SPARK-7591 URL: https://issues.apache.org/jira/browse/SPARK-7591 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Lian Priority: Blocker Fix For: 1.4.0 # Renaming {{FSBasedRelation}} to {{HadoopFsRelation}} Since itss all coupled with Hadoop {{FileSystem}} and job API. # {{HadoopFsRelation}} should have a no-arg constructor {{paths}} and {{partitionColumns}} should just be methods to be overridden, rather than constructor arguments. This makes data source developers life easier by having a no-arg constructor and being serialization friendly. # Renaming {{HadoopFsRelation.prepareForWrite}} to {{HadoopFsRelation.prepareJobForWrite}} The new name explicitly suggests developers should only touch the {{Job}} instance for preparation work (which is also documented in Scaladoc). # Allowing serialization while creating {{OutputWriter}}s To be more precise, {{OutputWriter}}s are never created on driver side and serialized to executor side. But the factory that creates {{OutputWriter}}s should be created on driver side and serialized. The reason behind this is that, passing all needed materials to {{OutputWriter}} instances via Hadoop Configuration is doable but sometimes neither intuitive nor convenient. Resorting to serialization makes data source developers' life easier. Actually this happens when I was migrating the Parquet data source, and wanted to pass the final output path (instead of temporary work path) to the output writer (see [here|https://github.com/liancheng/spark/commit/ec9950c591e5b981ce20fab96562db28488e0035#diff-53521d336f7259e859fea4d3ca4dc888R74]). There I have to put a property into the Configuration object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7621) Report KafkaReceiver MessageHandler errors so StreamingListeners can take action
[ https://issues.apache.org/jira/browse/SPARK-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545105#comment-14545105 ] Saisai Shao commented on SPARK-7621: Hi [~jerluc], you could submit a related PR on Github, Spark community submits patch on Github rather than JIRA. Report KafkaReceiver MessageHandler errors so StreamingListeners can take action Key: SPARK-7621 URL: https://issues.apache.org/jira/browse/SPARK-7621 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0, 1.3.1 Reporter: Jeremy A. Lucas Fix For: 1.3.1 Attachments: SPARK-7621.patch Original Estimate: 24h Remaining Estimate: 24h Currently, when a MessageHandler (for any of the Kafka Receiver implementations) encounters an error handling a message, the error is only logged with: {code:none} case e: Exception = logError(Error handling message, e) {code} It would be _incredibly_ useful to be able to notify any registered StreamingListener of this receiver error (especially since this {{try...catch}} block masks more fatal Kafka connection exceptions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7269) Incorrect aggregation analysis
[ https://issues.apache.org/jira/browse/SPARK-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544979#comment-14544979 ] Apache Spark commented on SPARK-7269: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/6173 Incorrect aggregation analysis -- Key: SPARK-7269 URL: https://issues.apache.org/jira/browse/SPARK-7269 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Blocker In a case insensitive analyzer (HiveContext), the attribute name captial differences will fail the analysis check for aggregation. {code} test(check analysis failed in case in-sensitive) { Seq(1,2,3).map(i = (i, i.toString)).toDF(key, value).registerTempTable(df_analysis) sql(SELECT kEy from df_analysis group by key) } {code} {noformat} expression 'kEy' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() if you don't care which value you get.; org.apache.spark.sql.AnalysisException: expression 'kEy' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() if you don't care which value you get.; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:39) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:85) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$4.apply(CheckAnalysis.scala:101) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$4.apply(CheckAnalysis.scala:101) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:101) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:89) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:39) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1121) at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:97) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply$mcV$sp(SQLQuerySuite.scala:408) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply(SQLQuerySuite.scala:406) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply(SQLQuerySuite.scala:406) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6258) Python MLlib API missing items: Clustering
[ https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-6258. -- Resolution: Fixed Issue resolved by pull request 6087 [https://github.com/apache/spark/pull/6087] Python MLlib API missing items: Clustering -- Key: SPARK-6258 URL: https://issues.apache.org/jira/browse/SPARK-6258 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Yanbo Liang This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k GaussianMixture * setInitialModel GaussianMixtureModel * k Completely missing items which should be fixed in separate JIRAs (which have been created and linked to the umbrella JIRA) * LDA * PowerIterationClustering * StreamingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API
[ https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545017#comment-14545017 ] Apache Spark commented on SPARK-7654: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/6175 DataFrameReader and DataFrameWriter for input/output API Key: SPARK-7654 URL: https://issues.apache.org/jira/browse/SPARK-7654 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We have a proliferation of save options now. It'd make more sense to have a builder pattern for write. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7661) Support for dynamic allocation of executors in Kinesis Spark Streaming
Murtaza Kanchwala created SPARK-7661: Summary: Support for dynamic allocation of executors in Kinesis Spark Streaming Key: SPARK-7661 URL: https://issues.apache.org/jira/browse/SPARK-7661 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.3.1 Environment: AWS-EMR Reporter: Murtaza Kanchwala Currently the logic for the no. of executors is (N + 1), where N is no. of shards in a Kinesis Stream. My Requirement is that if I use this Resharding util for Amazon Kinesis : Amazon Kinesis Resharding : https://github.com/awslabs/amazon-kinesis-scaling-utils Then there should be some way to allocate executors on the basis of no. of shards directly (for Spark Streaming only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API
[ https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7654: --- Assignee: Apache Spark (was: Reynold Xin) DataFrameReader and DataFrameWriter for input/output API Key: SPARK-7654 URL: https://issues.apache.org/jira/browse/SPARK-7654 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark We have a proliferation of save options now. It'd make more sense to have a builder pattern for write. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API
[ https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7654: --- Assignee: Reynold Xin (was: Apache Spark) DataFrameReader and DataFrameWriter for input/output API Key: SPARK-7654 URL: https://issues.apache.org/jira/browse/SPARK-7654 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We have a proliferation of save options now. It'd make more sense to have a builder pattern for write. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input
[ https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7651: --- Assignee: Apache Spark PySpark GMM predict, predictSoft should fail on bad input - Key: SPARK-7651 URL: https://issues.apache.org/jira/browse/SPARK-7651 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Joseph K. Bradley Assignee: Apache Spark Priority: Minor In PySpark, GaussianMixtureModel predict and predictSoft test if the argument is an RDD and operate correctly if so. But if the argument is not an RDD, they fail silently, returning nothing. [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176] Instead, they should raise errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input
[ https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7651: --- Assignee: (was: Apache Spark) PySpark GMM predict, predictSoft should fail on bad input - Key: SPARK-7651 URL: https://issues.apache.org/jira/browse/SPARK-7651 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Joseph K. Bradley Priority: Minor In PySpark, GaussianMixtureModel predict and predictSoft test if the argument is an RDD and operate correctly if so. But if the argument is not an RDD, they fail silently, returning nothing. [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176] Instead, they should raise errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input
[ https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545126#comment-14545126 ] Apache Spark commented on SPARK-7651: - User 'FlytxtRnD' has created a pull request for this issue: https://github.com/apache/spark/pull/6180 PySpark GMM predict, predictSoft should fail on bad input - Key: SPARK-7651 URL: https://issues.apache.org/jira/browse/SPARK-7651 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: Joseph K. Bradley Priority: Minor In PySpark, GaussianMixtureModel predict and predictSoft test if the argument is an RDD and operate correctly if so. But if the argument is not an RDD, they fail silently, returning nothing. [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176] Instead, they should raise errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7586) User guide update for spark.ml Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7586: --- Assignee: Apache Spark (was: Xusen Yin) User guide update for spark.ml Word2Vec --- Key: SPARK-7586 URL: https://issues.apache.org/jira/browse/SPARK-7586 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Assignee: Apache Spark Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} Note: I created a new subsection for links to spark.ml-specific guides in this JIRA's PR: [SPARK-7557]. This transformer can go within the new subsection. I'll try to get that PR merged ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7586) User guide update for spark.ml Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7586: --- Assignee: Xusen Yin (was: Apache Spark) User guide update for spark.ml Word2Vec --- Key: SPARK-7586 URL: https://issues.apache.org/jira/browse/SPARK-7586 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Assignee: Xusen Yin Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} Note: I created a new subsection for links to spark.ml-specific guides in this JIRA's PR: [SPARK-7557]. This transformer can go within the new subsection. I'll try to get that PR merged ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7586) User guide update for spark.ml Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545153#comment-14545153 ] Apache Spark commented on SPARK-7586: - User 'yinxusen' has created a pull request for this issue: https://github.com/apache/spark/pull/6181 User guide update for spark.ml Word2Vec --- Key: SPARK-7586 URL: https://issues.apache.org/jira/browse/SPARK-7586 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Assignee: Xusen Yin Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} Note: I created a new subsection for links to spark.ml-specific guides in this JIRA's PR: [SPARK-7557]. This transformer can go within the new subsection. I'll try to get that PR merged ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7663) [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero
[ https://issues.apache.org/jira/browse/SPARK-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7663: - Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) Yeah it should be an error in any event. It's just a question of whether you want a different error. You could `require` a non-empty iterator with an appropriate error message instead. [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero - Key: SPARK-7663 URL: https://issues.apache.org/jira/browse/SPARK-7663 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xusen Yin Priority: Minor Fix For: 1.4.1 mllib.feature.Word2Vec at line 442: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L442 uses `.head` to get the vector size. But it would throw an empty iterator error if the `minCount` is large enough to remove all words in the dataset. But due to this is not a common scenario, so maybe we can ignore it. If so, we can close the issue directly. If not, I can add some code to print more elegant error hits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7566) HiveContext.analyzer cannot be overriden
[ https://issues.apache.org/jira/browse/SPARK-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545066#comment-14545066 ] Apache Spark commented on SPARK-7566: - User 'smola' has created a pull request for this issue: https://github.com/apache/spark/pull/6177 HiveContext.analyzer cannot be overriden Key: SPARK-7566 URL: https://issues.apache.org/jira/browse/SPARK-7566 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Santiago M. Mola Assignee: Santiago M. Mola Fix For: 1.4.0 Trying to override HiveContext.analyzer will give the following compilation error: {code} Error:(51, 36) overriding lazy value analyzer in class HiveContext of type org.apache.spark.sql.catalyst.analysis.Analyzer{val extendedResolutionRules: List[org.apache.spark.sql.catalyst.rules.Rule[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]]}; lazy value analyzer has incompatible type override protected[sql] lazy val analyzer: Analyzer = { ^ {code} That is because the type changed inadvertedly when omitting the type declaration of the return type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7662) Exception of multi-attribute generator anlysis in projection
Cheng Hao created SPARK-7662: Summary: Exception of multi-attribute generator anlysis in projection Key: SPARK-7662 URL: https://issues.apache.org/jira/browse/SPARK-7662 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Blocker {code} select explode(map(value, key)) from src; {code} It throws exception like {panel} org.apache.spark.sql.AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got _c0 ; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7663) [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero
[ https://issues.apache.org/jira/browse/SPARK-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7663: - Fix Version/s: (was: 1.4.1) (Don't set Fix Version please: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark ) [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero - Key: SPARK-7663 URL: https://issues.apache.org/jira/browse/SPARK-7663 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xusen Yin Priority: Minor mllib.feature.Word2Vec at line 442: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L442 uses `.head` to get the vector size. But it would throw an empty iterator error if the `minCount` is large enough to remove all words in the dataset. But due to this is not a common scenario, so maybe we can ignore it. If so, we can close the issue directly. If not, I can add some code to print more elegant error hits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545006#comment-14545006 ] Josh Rosen commented on SPARK-4105: --- I've opened SPARK-7660 to track progress on the fix for the snappy-java buffer sharing bug. FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle - Key: SPARK-4105 URL: https://issues.apache.org/jira/browse/SPARK-4105 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Attachments: JavaObjectToSerialize.java, SparkFailedToUncompressGenerator.scala We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during shuffle read. Here's a sample stacktrace from an executor: {code} 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 33053) java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Here's another occurrence of a similar error: {code}
[jira] [Updated] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API
[ https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7654: --- Summary: DataFrameReader and DataFrameWriter for input/output API (was: Create builder pattern for DataFrame.save) DataFrameReader and DataFrameWriter for input/output API Key: SPARK-7654 URL: https://issues.apache.org/jira/browse/SPARK-7654 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We have a proliferation of save options now. It'd make more sense to have a builder pattern instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API
[ https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7654: --- Description: We have a proliferation of save options now. It'd make more sense to have a builder pattern for write. was: We have a proliferation of save options now. It'd make more sense to have a builder pattern instead. DataFrameReader and DataFrameWriter for input/output API Key: SPARK-7654 URL: https://issues.apache.org/jira/browse/SPARK-7654 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We have a proliferation of save options now. It'd make more sense to have a builder pattern for write. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures
[ https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7660: --- Assignee: Apache Spark Snappy-java buffer-sharing bug leads to data corruption / test failures --- Key: SPARK-7660 URL: https://issues.apache.org/jira/browse/SPARK-7660 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.4.0 Reporter: Josh Rosen Assignee: Apache Spark Priority: Blocker snappy-java contains a bug that can lead to situations where separate SnappyOutputStream instances end up sharing the same input and output buffers, which can lead to data corruption issues. See https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue. I discovered this issue because the buffer-sharing was leading to a test failure in JavaAPISuite: one of the repartition-and-sort tests was returning the wrong answer because both tasks wrote their output using the same compression buffers and one task won the race, causing its output to be written to both shuffle output files. As a result, the test returned the result of collecting one partition twice (see https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more details). The buffer-sharing can only occur if {{close()}} is called twice on the same SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a more precise description of when this issue may occur, see my upstream tickets). I think that this double-close happens somewhere in some test code that was added as part of my Tungsten shuffle patch, exposing this bug (to see this, download a recent build of master and run https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to force the test execution order that triggers the bug). I think that it's rare that this bug would lead to silent failures like this. In more realistic workloads that aren't writing only a handful of bytes per task, I would expect this issue to lead to stream corruption issues like SPARK-4105. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures
[ https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545031#comment-14545031 ] Apache Spark commented on SPARK-7660: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/6176 Snappy-java buffer-sharing bug leads to data corruption / test failures --- Key: SPARK-7660 URL: https://issues.apache.org/jira/browse/SPARK-7660 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.4.0 Reporter: Josh Rosen Priority: Blocker snappy-java contains a bug that can lead to situations where separate SnappyOutputStream instances end up sharing the same input and output buffers, which can lead to data corruption issues. See https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue. I discovered this issue because the buffer-sharing was leading to a test failure in JavaAPISuite: one of the repartition-and-sort tests was returning the wrong answer because both tasks wrote their output using the same compression buffers and one task won the race, causing its output to be written to both shuffle output files. As a result, the test returned the result of collecting one partition twice (see https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more details). The buffer-sharing can only occur if {{close()}} is called twice on the same SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a more precise description of when this issue may occur, see my upstream tickets). I think that this double-close happens somewhere in some test code that was added as part of my Tungsten shuffle patch, exposing this bug (to see this, download a recent build of master and run https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to force the test execution order that triggers the bug). I think that it's rare that this bug would lead to silent failures like this. In more realistic workloads that aren't writing only a handful of bytes per task, I would expect this issue to lead to stream corruption issues like SPARK-4105. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures
[ https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7660: --- Assignee: (was: Apache Spark) Snappy-java buffer-sharing bug leads to data corruption / test failures --- Key: SPARK-7660 URL: https://issues.apache.org/jira/browse/SPARK-7660 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.4.0 Reporter: Josh Rosen Priority: Blocker snappy-java contains a bug that can lead to situations where separate SnappyOutputStream instances end up sharing the same input and output buffers, which can lead to data corruption issues. See https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue. I discovered this issue because the buffer-sharing was leading to a test failure in JavaAPISuite: one of the repartition-and-sort tests was returning the wrong answer because both tasks wrote their output using the same compression buffers and one task won the race, causing its output to be written to both shuffle output files. As a result, the test returned the result of collecting one partition twice (see https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more details). The buffer-sharing can only occur if {{close()}} is called twice on the same SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a more precise description of when this issue may occur, see my upstream tickets). I think that this double-close happens somewhere in some test code that was added as part of my Tungsten shuffle patch, exposing this bug (to see this, download a recent build of master and run https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to force the test execution order that triggers the bug). I think that it's rare that this bug would lead to silent failures like this. In more realistic workloads that aren't writing only a handful of bytes per task, I would expect this issue to lead to stream corruption issues like SPARK-4105. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures
[ https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545014#comment-14545014 ] Josh Rosen commented on SPARK-7660: --- If we're wary of upgrading to a new Snappy version and don't want to wait for a new release / backport, one option is to just wrap SnappyOutputStream with our own code to make close() idempotent. I don't think that this will have any significant overhead if done right, since the JIT should be able to inline the SnappyOutputStream calls. Snappy-java buffer-sharing bug leads to data corruption / test failures --- Key: SPARK-7660 URL: https://issues.apache.org/jira/browse/SPARK-7660 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.4.0 Reporter: Josh Rosen Priority: Blocker snappy-java contains a bug that can lead to situations where separate SnappyOutputStream instances end up sharing the same input and output buffers, which can lead to data corruption issues. See https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue. I discovered this issue because the buffer-sharing was leading to a test failure in JavaAPISuite: one of the repartition-and-sort tests was returning the wrong answer because both tasks wrote their output using the same compression buffers and one task won the race, causing its output to be written to both shuffle output files. As a result, the test returned the result of collecting one partition twice (see https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more details). The buffer-sharing can only occur if {{close()}} is called twice on the same SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a more precise description of when this issue may occur, see my upstream tickets). I think that this double-close happens somewhere in some test code that was added as part of my Tungsten shuffle patch, exposing this bug (to see this, download a recent build of master and run https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to force the test execution order that triggers the bug). I think that it's rare that this bug would lead to silent failures like this. In more realistic workloads that aren't writing only a handful of bytes per task, I would expect this issue to lead to stream corruption issues like SPARK-4105. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7663) [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero
Xusen Yin created SPARK-7663: Summary: [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero Key: SPARK-7663 URL: https://issues.apache.org/jira/browse/SPARK-7663 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 1.4.0 Reporter: Xusen Yin Fix For: 1.4.1 mllib.feature.Word2Vec at line 442: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L442 uses `.head` to get the vector size. But it would throw an empty iterator error if the `minCount` is large enough to remove all words in the dataset. But due to this is not a common scenario, so maybe we can ignore it. If so, we can close the issue directly. If not, I can add some code to print more elegant error hits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7227) Support fillna / dropna in R DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7227: --- Assignee: Apache Spark (was: Sun Rui) Support fillna / dropna in R DataFrame -- Key: SPARK-7227 URL: https://issues.apache.org/jira/browse/SPARK-7227 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Reynold Xin Assignee: Apache Spark Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7227) Support fillna / dropna in R DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7227: --- Assignee: Sun Rui (was: Apache Spark) Support fillna / dropna in R DataFrame -- Key: SPARK-7227 URL: https://issues.apache.org/jira/browse/SPARK-7227 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Reynold Xin Assignee: Sun Rui Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7227) Support fillna / dropna in R DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545300#comment-14545300 ] Apache Spark commented on SPARK-7227: - User 'sun-rui' has created a pull request for this issue: https://github.com/apache/spark/pull/6183 Support fillna / dropna in R DataFrame -- Key: SPARK-7227 URL: https://issues.apache.org/jira/browse/SPARK-7227 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Reynold Xin Assignee: Sun Rui Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7657) [YARN] Show driver link in Spark UI
[ https://issues.apache.org/jira/browse/SPARK-7657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7657: - Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) [YARN] Show driver link in Spark UI --- Key: SPARK-7657 URL: https://issues.apache.org/jira/browse/SPARK-7657 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.4.0 Reporter: Hari Shreedharan Priority: Minor Currently, the driver link does not show up in the application UI. It is painful to debug apps running in cluster mode if the link does not show up. Client mode is fine since the links are local to the client machine. In YARN mode, it is possible to just get this from the YARN container report. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6499) pyspark: printSchema command on a dataframe hangs
[ https://issues.apache.org/jira/browse/SPARK-6499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545355#comment-14545355 ] Sean Owen commented on SPARK-6499: -- I can't reproduce this. Are you sure it still happens? what version, master? pyspark: printSchema command on a dataframe hangs - Key: SPARK-6499 URL: https://issues.apache.org/jira/browse/SPARK-6499 Project: Spark Issue Type: Bug Components: PySpark Reporter: cynepia Attachments: airports.json, pyspark.txt 1. A printSchema() on a dataframe fails to respond even after a lot of time Will attach the console logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6399) Code compiled against 1.3.0 may not run against older Spark versions
[ https://issues.apache.org/jira/browse/SPARK-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6399: - Issue Type: Improvement (was: Bug) Code compiled against 1.3.0 may not run against older Spark versions Key: SPARK-6399 URL: https://issues.apache.org/jira/browse/SPARK-6399 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Affects Versions: 1.3.0 Reporter: Marcelo Vanzin Commit 65b987c3 re-organized the implicit conversions of RDDs so that they're easier to use. The problem is that scalac now generates code that will not run on older Spark versions if those conversions are used. Basically, even if you explicitly import {{SparkContext._}}, scalac will generate references to the new methods in the {{RDD}} object instead. So the compiled code will reference code that doesn't exist in older versions of Spark. You can work around this by explicitly calling the methods in the {{SparkContext}} object, although that's a little ugly. We should at least document this limitation (if there's no way to fix it), since I believe forwards compatibility in the API was also a goal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6287) Add support for dynamic allocation in the Mesos coarse-grained scheduler
[ https://issues.apache.org/jira/browse/SPARK-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6287: - Issue Type: Improvement (was: Bug) Add support for dynamic allocation in the Mesos coarse-grained scheduler Key: SPARK-6287 URL: https://issues.apache.org/jira/browse/SPARK-6287 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Iulian Dragos Add support inside the coarse-grained Mesos scheduler for dynamic allocation. It amounts to implementing two methods that allow scaling up and down the number of executors: {code} def doKillExecutors(executorIds: Seq[String]) def doRequestTotalExecutors(requestedTotal: Int) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7336) Sometimes the status of finished job show on JobHistory UI will be active, and never update.
[ https://issues.apache.org/jira/browse/SPARK-7336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7336: - Priority: Minor (was: Major) Sometimes the status of finished job show on JobHistory UI will be active, and never update. Key: SPARK-7336 URL: https://issues.apache.org/jira/browse/SPARK-7336 Project: Spark Issue Type: Bug Components: Web UI Reporter: ShaoChuan Priority: Minor When I run a SparkPi job, the status of the job on JobHistory UI was 'active'. After the job finished for a long time, the status on JobHistory UI never update again, and the job keep in the 'Incomplete applications' list. This problem appears occasionally. And the configuration of JobHistory is default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6520) Kyro serialization broken in the shell
[ https://issues.apache.org/jira/browse/SPARK-6520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6520. -- Resolution: Won't Fix Yes, I think this is a function of how {{:paste}}d code is evaluated and how that interacts with what Kryo expects. I don't know that it's realistic to expect that changes; spark-shell is just quite different in how classes are defined on the fly. You can run a compiled program and you can separately paste your class definitions first if you had to. Kyro serialization broken in the shell -- Key: SPARK-6520 URL: https://issues.apache.org/jira/browse/SPARK-6520 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.0 Reporter: Aaron Defazio If I start spark as follows: {quote} ~/spark-1.3.0-bin-hadoop2.4/bin/spark-shell --master local[1] --conf spark.serializer=org.apache.spark.serializer.KryoSerializer {quote} Then using :paste, run {quote} case class Example(foo : String, bar : String) val ex = sc.parallelize(List(Example(foo1, bar1), Example(foo2, bar2))).collect() {quote} I get the error: {quote} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.io.IOException: com.esotericsoftware.kryo.KryoException: Error constructing instance of class: $line3.$read Serialization trace: $VAL10 ($iwC) $outer ($iwC$$iwC) $outer ($iwC$$iwC$Example) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1140) at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:979) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1873) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) {quote} As far as I can tell, when using :paste, Kyro serialization doesn't work for classes defined in within the same paste. It does work when the statements are entered without paste. This issue seems serious to me, since Kyro serialization is virtually mandatory for performance (20x slower with default serialization on my problem), and I'm assuming feature parity between spark-shell and spark-submit is a goal. Note that this is different from SPARK-6497, which covers the case when Kyro is set to require registration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5711) Sort Shuffle performance issues about using AppendOnlyMap for large data sets
[ https://issues.apache.org/jira/browse/SPARK-5711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5711. -- Resolution: Not A Problem I'm not sure this qualifies as a bug. You're just saying that processing a lot of data took a long time, and that time was spent somewhere. If you have a specific suggestion about how to set the size of this map more intelligently to avoid growing/rehashing, we can reopen. Sort Shuffle performance issues about using AppendOnlyMap for large data sets - Key: SPARK-5711 URL: https://issues.apache.org/jira/browse/SPARK-5711 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Environment: hbase-0.98.6-cdh5.2.0 phoenix-4.2.2 Reporter: Sun Fulin Recently we had caught performance issues when using spark 1.2.0 to read data from hbase and do some summary work. Our scenario means to : read large data sets from hbase (maybe 100G+ file) , form hbaseRDD, transform to schemardd, groupby and aggregate the data while got fewer new summary data sets, loading data into hbase (phoenix). Our major issue lead to : aggregate large datasets to get summary data sets would consume too long time (1 hour +) , while that should be supposed not so bad performance. We got the dump file attached and stacktrace from jstack like the following: From the stacktrace and dump file we can identify that processing large datasets would cause frequent AppendOnlyMap growing, and leading to huge map entrysize. We had referenced the source code of org.apache.spark.util.collection.AppendOnlyMap and found that the map had been initialized with capacity of 64. That would be too small for our use case. Thread 22432: (state = IN_JAVA) - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 (Compiled frame; information may be imprecise) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() @bci=1, line=38 (Interpreted frame) - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, line=198 (Compiled frame) - org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=201, line=145 (Compiled frame) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=3, line=32 (Compiled frame) - org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator) @bci=141, line=205 (Compiled frame) - org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator) @bci=74, line=58 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=169, line=68 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=2, line=41 (Interpreted frame) - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted frame) - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1145 (Interpreted frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame) Thread 22431: (state = IN_JAVA) - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 (Compiled frame; information may be imprecise) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() @bci=1, line=38 (Interpreted frame) - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, line=198 (Compiled frame) - org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=201, line=145 (Compiled frame) - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object, scala.Function2) @bci=3, line=32 (Compiled frame) - org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator) @bci=141, line=205 (Compiled frame) - org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator) @bci=74, line=58 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=169, line=68 (Interpreted frame) - org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) @bci=2, line=41 (Interpreted frame) - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted frame) - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 (Interpreted frame) -
[jira] [Commented] (SPARK-5412) Cannot bind Master to a specific hostname as per the documentation
[ https://issues.apache.org/jira/browse/SPARK-5412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545371#comment-14545371 ] Sean Owen commented on SPARK-5412: -- A-ha. I think the issue is that additional args to {{start-master.sh}} aren't passed through to {{Master}} with $@. I think they are intended to be, as the same thing is done in {{start-slave.sh}} for example. Let me look a little more and open a PR if it seems like the right thing to do. Cannot bind Master to a specific hostname as per the documentation -- Key: SPARK-5412 URL: https://issues.apache.org/jira/browse/SPARK-5412 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.2.0 Reporter: Alexis Seigneurin Documentation on http://spark.apache.org/docs/latest/spark-standalone.html indicates: {quote} You can start a standalone master server by executing: ./sbin/start-master.sh ... the following configuration options can be passed to the master and worker: ... -h HOST, --host HOST Hostname to listen on {quote} The \-h or --host parameter actually doesn't work with the start-master.sh script. Instead, one has to set the SPARK_MASTER_IP variable prior to executing the script. Either the script or the documentation should be updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7664) DAG visualization: Fix incorrect link paths of DAG.
[ https://issues.apache.org/jira/browse/SPARK-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-7664: -- Summary: DAG visualization: Fix incorrect link paths of DAG. (was: Fix incorrect link paths of DAG.) DAG visualization: Fix incorrect link paths of DAG. --- Key: SPARK-7664 URL: https://issues.apache.org/jira/browse/SPARK-7664 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Priority: Minor In JobPage, we can jump a StagePage when we click corresponding box of DAG viz but the link path is incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7631) treenode argString should not print children
[ https://issues.apache.org/jira/browse/SPARK-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7631: - Priority: Minor (was: Major) treenode argString should not print children Key: SPARK-7631 URL: https://issues.apache.org/jira/browse/SPARK-7631 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Fei Wang Priority: Minor spark-sql explain extended select * from ( select key from src union all select key from src) t; the spark plan will print children in argString == Physical Plan == Union[ HiveTableScan [key#1], (MetastoreRelation default, src, None), None, HiveTableScan [key#3], (MetastoreRelation default, src, None), None] HiveTableScan [key#1], (MetastoreRelation default, src, None), None HiveTableScan [key#3], (MetastoreRelation default, src, None), None -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7536) Audit MLlib Python API for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-7536: --- Description: For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. SPARK-7666 * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. SPARK-7665 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python. ** classification *** StreamingLogisticRegressionWithSGD SPARK-7633 ** clustering *** GaussianMixture SPARK-6258 *** LDA SPARK-6259 *** Power Iteration Clustering SPARK-5962 *** StreamingKMeans SPARK-4118 ** evaluation *** MultilabelMetrics SPARK-6094 ** feature *** ElementwiseProduct SPARK-7605 *** PCA SPARK-7604 ** linalg *** Distributed linear algebra SPARK-6100 ** pmml.export SPARK-7638 ** regression *** StreamingLinearRegressionWithSGD SPARK-4127 ** stat *** KernelDensity SPARK-7639 ** util *** MLUtils SPARK-6263 was: For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. SPARK-7665 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python. ** classification *** StreamingLogisticRegressionWithSGD SPARK-7633 ** clustering *** GaussianMixture SPARK-6258 *** LDA SPARK-6259 *** Power Iteration Clustering SPARK-5962 *** StreamingKMeans SPARK-4118 ** evaluation *** MultilabelMetrics SPARK-6094 ** feature *** ElementwiseProduct SPARK-7605 *** PCA SPARK-7604 ** linalg *** Distributed linear algebra SPARK-6100 ** pmml.export SPARK-7638 ** regression *** StreamingLinearRegressionWithSGD SPARK-4127 ** stat *** KernelDensity SPARK-7639 ** util *** MLUtils SPARK-6263 Audit MLlib Python API for 1.4 -- Key: SPARK-7536 URL: https://issues.apache.org/jira/browse/SPARK-7536 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Yanbo Liang For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. SPARK-7666 * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. SPARK-7665 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python. ** classification *** StreamingLogisticRegressionWithSGD SPARK-7633 ** clustering *** GaussianMixture SPARK-6258 *** LDA SPARK-6259 *** Power Iteration Clustering SPARK-5962 *** StreamingKMeans SPARK-4118 ** evaluation *** MultilabelMetrics SPARK-6094 ** feature *** ElementwiseProduct SPARK-7605 *** PCA SPARK-7604 ** linalg *** Distributed linear algebra SPARK-6100 ** pmml.export SPARK-7638 ** regression *** StreamingLinearRegressionWithSGD SPARK-4127 ** stat *** KernelDensity SPARK-7639 ** util *** MLUtils SPARK-6263 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7666) MLlib Python doc parity check
Yanbo Liang created SPARK-7666: -- Summary: MLlib Python doc parity check Key: SPARK-7666 URL: https://issues.apache.org/jira/browse/SPARK-7666 Project: Spark Issue Type: Documentation Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Check then make the MLlib Python doc to be as complete as the Scala doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6399) Code compiled against 1.3.0 may not run against older Spark versions
[ https://issues.apache.org/jira/browse/SPARK-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6399: - Priority: Minor (was: Major) Code compiled against 1.3.0 may not run against older Spark versions Key: SPARK-6399 URL: https://issues.apache.org/jira/browse/SPARK-6399 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Affects Versions: 1.3.0 Reporter: Marcelo Vanzin Priority: Minor Commit 65b987c3 re-organized the implicit conversions of RDDs so that they're easier to use. The problem is that scalac now generates code that will not run on older Spark versions if those conversions are used. Basically, even if you explicitly import {{SparkContext._}}, scalac will generate references to the new methods in the {{RDD}} object instead. So the compiled code will reference code that doesn't exist in older versions of Spark. You can work around this by explicitly calling the methods in the {{SparkContext}} object, although that's a little ugly. We should at least document this limitation (if there's no way to fix it), since I believe forwards compatibility in the API was also a goal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6355) Spark standalone cluster does not support local:/ url for jar file
[ https://issues.apache.org/jira/browse/SPARK-6355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6355: - Priority: Minor (was: Major) Spark standalone cluster does not support local:/ url for jar file -- Key: SPARK-6355 URL: https://issues.apache.org/jira/browse/SPARK-6355 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1, 1.3.0 Reporter: Jesper Lundgren Priority: Minor Submitting a new spark application to a standalone cluster with local:/path will result in an exception. Driver successfully submitted as driver-20150316171157-0004 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150316171157-0004 is ERROR Exception from cluster was: java.io.IOException: No FileSystem for scheme: local java.io.IOException: No FileSystem for scheme: local at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:141) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6415) Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills the app
[ https://issues.apache.org/jira/browse/SPARK-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545356#comment-14545356 ] Sean Owen commented on SPARK-6415: -- Sort of related to https://issues.apache.org/jira/browse/SPARK-4545 You don't really want the whole streaming system to stop if one batch fails though, right? I can see wanting to stop it if every batch will fail, though that's harder to know. Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills the app - Key: SPARK-6415 URL: https://issues.apache.org/jira/browse/SPARK-6415 Project: Spark Issue Type: Bug Components: Streaming Reporter: Hari Shreedharan Of course, this would have to be done as a configurable param, but such a fail-fast is useful else it is painful to figure out what is happening when there are cascading failures. In some cases, the SparkContext shuts down and streaming keeps scheduling jobs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6415) Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills the app
[ https://issues.apache.org/jira/browse/SPARK-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6415: - Issue Type: Improvement (was: Bug) Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills the app - Key: SPARK-6415 URL: https://issues.apache.org/jira/browse/SPARK-6415 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Hari Shreedharan Of course, this would have to be done as a configurable param, but such a fail-fast is useful else it is painful to figure out what is happening when there are cascading failures. In some cases, the SparkContext shuts down and streaming keeps scheduling jobs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6035) Unable to launch spark stream driver in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6035. -- Resolution: Not A Problem This looks like a problem specific to your setup on EC2. Something failed to startup. Unable to launch spark stream driver in cluster mode Key: SPARK-6035 URL: https://issues.apache.org/jira/browse/SPARK-6035 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.1 Reporter: pankaj Spark version : spark-1.2.1-bin-hadoop2.4 Geting error while launching driver from master node in cluster mode and it gets launched on one of slave node in cluster. Issue: while launching from master it launches driver on one of slave and tries to connect to the Master on 0 port. in this case i am launching it from master even if i tries to launch it from some other slave node it tries to connect to the slave node from where i launched it. Exception 2015-02-26 07:36:05 INFO SecurityManager:59 - Changing view acls to: root 2015-02-26 07:36:05 INFO SecurityManager:59 - Changing modify acls to: root 2015-02-26 07:36:05 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 2015-02-26 07:36:05 DEBUG AkkaUtils:63 - In createActorSystem, requireCookie is: off 2015-02-26 07:36:06 INFO Slf4jLogger:80 - Slf4jLogger started 2015-02-26 07:36:06 ERROR NettyTransport:65 - failed to bind to ec-node -where i run submit command-.compute-1.amazonaws.com/xx.xx.xx.xx:0, shutting down Netty transport 2015-02-26 07:36:06 ERROR Remoting:65 - Remoting error: [Startup failed] [ akka.remote.RemoteTransportException: Startup failed at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:136) at akka.remote.Remoting.start(Remoting.scala:201) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632) at akka.actor.ActorSystem$.apply(ActorSystem.scala:141) at akka.actor.ActorSystem$.apply(ActorSystem.scala:118) at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54) Thanks Pankaj -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6533) Allow using wildcard and other file pattern in Parquet DataSource
[ https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6533: - Labels: (was: backport-needed) Allow using wildcard and other file pattern in Parquet DataSource - Key: SPARK-6533 URL: https://issues.apache.org/jira/browse/SPARK-6533 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.3.1 Reporter: Jianshi Huang Priority: Critical By default, spark.sql.parquet.useDataSourceApi is set to true. And loading parquet files using file pattern will throw errors. *\*Wildcard* {noformat} scala val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*) 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. java.io.FileNotFoundException: File does not exist: hdfs://.../source=live/date=2014-06-0* at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) {noformat} And *\[abc\]* {noformat} val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0[12]) java.lang.IllegalArgumentException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI.create(URI.java:859) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388) at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522) ... 49 elided Caused by: java.net.URISyntaxException: Illegal character in path at index 74: hdfs://.../source=live/date=2014-06-0[12] at java.net.URI$Parser.fail(URI.java:2829) at java.net.URI$Parser.checkChars(URI.java:3002) at java.net.URI$Parser.parseHierarchical(URI.java:3086) at java.net.URI$Parser.parse(URI.java:3034) at java.net.URI.init(URI.java:595) at java.net.URI.create(URI.java:857) {noformat} If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition discovery, schema evolution etc, but being able to specify file pattern is also very important to applications. Please add this important feature. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7476) Dynamic partitioning random behaviour
[ https://issues.apache.org/jira/browse/SPARK-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7476. -- Resolution: Invalid I think this is at best a question for user@. I don't think this relates to dynamic partition discovery if that's what you mean, nor is it random. Dynamic partitioning random behaviour - Key: SPARK-7476 URL: https://issues.apache.org/jira/browse/SPARK-7476 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Environment: Spark SQL in standalone mode on CDH 5.3 Reporter: Eswara Reddy Adapa According documentation spark sql 1.2 supports dynamic partitioning. But, we see below Expected output - !http://ibin.co/20zI4242Ur1h! Output seen – !http://ibin.co/20zILW5LQ5nT! It is generating only one partition (on a random value each time) Query: USE miah_ga; SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode = nonstrict; DROP TABLE sem_stg_tmp_part; CREATE TABLE sem_stg_tmp_part (chnl_nm string ,cmpgn_yr_nbr string ,cmpgn_qtr_nbr string ,actv_mo_nm string ,actv_wk_end_dt string ,actv_dt string ,seg_nm string ,bdgt_node_nm string ,ad_grp_nm string ,kywrd_txt string ,srch_engn_nm string ,engn_cmpgn_nm string ,publ_ctry_nm string ,publ_geo_cd string ,last_dest_url_txt string ,dvc_cat_nm string ,mdia_propty_nm string ,mdia_type_nm string ,audnc_nm string ,sub_aud_nm string ,prog_nm string ,org_intv_nm string ,sub_org_intv_nm string ,prd_nm string ,prim_engg_dest_nm string ,imprsn_cnt string ,click_cnt string ,tot_cost_amt string ,ctr_pct string ,cpc_amt string ,vist_cnt string ,paid_per_vist_cnt string ,paid_pir_vist_cnt string ,cost_per_intel_per_vist_amt string) partitioned by (ddate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; INSERT overwrite TABLE sem_stg_tmp_part PARTITION (ddate) SELECT *, concat(substr(actv_dt,1,2),substr(actv_dt,4,2),substr(actv_dt,7,4)) as ddate FROM miah_ga.sem_stg_tmp; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7665) MLlib Python API breaking changes check between 1.3 1.4
Yanbo Liang created SPARK-7665: -- Summary: MLlib Python API breaking changes check between 1.3 1.4 Key: SPARK-7665 URL: https://issues.apache.org/jira/browse/SPARK-7665 Project: Spark Issue Type: Documentation Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Comparing the MLlib Python APIs between 1.3 and 1.4, so we can note breaking changes. We'll need to note those changes (if any) in the user guide's Migration Guide section. If the API change is for an Alpha/Experimental/DeveloperApi component, we need also note that as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7063) Update lz4 for Java 7 to avoid: when lz4 compression is used, it causes core dump
[ https://issues.apache.org/jira/browse/SPARK-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545368#comment-14545368 ] Tim Ellison commented on SPARK-7063: I can confirm that this failure is no longer seen using LZ4 1.3.0 with IBM Java 7+. Update lz4 for Java 7 to avoid: when lz4 compression is used, it causes core dump - Key: SPARK-7063 URL: https://issues.apache.org/jira/browse/SPARK-7063 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1 Environment: IBM JDK Reporter: Jenny MA Priority: Minor this issue is initially noticed by using IBM JDK, below please find the stack track of this issue, caused by violating the rule in critical section. #0 0x00314340f3cb in raise () from /service/pmrs/45638/20/lib64/libpthread.so.0 #1 0x7f795b0323be in j9dump_create () from /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9prt27.so #2 0x7f795a88ba2a in doSystemDump () from /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9dmp27.so #3 0x7f795b0405d5 in j9sig_protect () from /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9prt27.so #4 0x7f795a88a1fd in runDumpFunction () from /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9dmp27.so #5 0x7f795a88dbab in runDumpAgent () from /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9dmp27.so #6 0x7f795a8a1c49 in triggerDumpAgents () from /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9dmp27.so #7 0x7f795a4518fe in doTracePoint () from /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9trc27.so #8 0x7f795a45210e in j9Trace () from /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9trc27.so #9 0x7f79590e46e1 in MM_StandardAccessBarrier::jniReleasePrimitiveArrayCritical(J9VMThread*, _jarray*, void*, int) () from /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9gc27.so #10 0x7f7938bc397c in Java_net_jpountz_lz4_LZ4JNI_LZ4_1compress_1limitedOutput () from /service/pmrs/45638/20/tmp/liblz4-java7155003924599399415.so #11 0x7f795b707149 in VMprJavaSendNative () from /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9vm27.so #12 0x in ?? () this is an issue introduced by a bug in net.jpountz.lz4.lz4-1.2.0.jar, and fixed in 1.3.0 version. Sun JDK /Open JDK doesn't complain this issue, but this issue will trigger assertion failure when IBM JDK is used. here is the link to the fix https://github.com/jpountz/lz4-java/commit/07229aa2f788229ab4f50379308297f428e3d2d2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7664) DAG visualization: Fix incorrect link paths of DAG.
[ https://issues.apache.org/jira/browse/SPARK-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545305#comment-14545305 ] Apache Spark commented on SPARK-7664: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/6184 DAG visualization: Fix incorrect link paths of DAG. --- Key: SPARK-7664 URL: https://issues.apache.org/jira/browse/SPARK-7664 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Priority: Minor In JobPage, we can jump a StagePage when we click corresponding box of DAG viz but the link path is incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7664) DAG visualization: Fix incorrect link paths of DAG.
[ https://issues.apache.org/jira/browse/SPARK-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7664: --- Assignee: (was: Apache Spark) DAG visualization: Fix incorrect link paths of DAG. --- Key: SPARK-7664 URL: https://issues.apache.org/jira/browse/SPARK-7664 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Priority: Minor In JobPage, we can jump a StagePage when we click corresponding box of DAG viz but the link path is incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7664) DAG visualization: Fix incorrect link paths of DAG.
[ https://issues.apache.org/jira/browse/SPARK-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7664: --- Assignee: Apache Spark DAG visualization: Fix incorrect link paths of DAG. --- Key: SPARK-7664 URL: https://issues.apache.org/jira/browse/SPARK-7664 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Assignee: Apache Spark Priority: Minor In JobPage, we can jump a StagePage when we click corresponding box of DAG viz but the link path is incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6973) The total stages on the allJobsPage is wrong
[ https://issues.apache.org/jira/browse/SPARK-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6973: - Component/s: (was: Spark Core) Web UI The total stages on the allJobsPage is wrong Key: SPARK-6973 URL: https://issues.apache.org/jira/browse/SPARK-6973 Project: Spark Issue Type: Bug Components: Web UI Reporter: meiyoula Priority: Minor Attachments: allJobs.png The job has two stages, map and collect stage. Both two retried two times. The first and second time of map stage is successful, and the third time skipped. Of collect stage, the first and second time is failed, and the third time is successful. On the allJobs page, the number of total stages is allStages-skippedStages. Mostly it's wright, but here I think total stages should be 2. The example: Stage 0: Map Stage Stage 1: Collect Stage Stage: Stage 0 - Stage 1 - Stage 0(retry 1) - Stage 1(retry 1) - Stage 0(retry 2) - Stage 1(retry 2) Status: Success - Fail -Success - Fail -Skipped - Success Though one of Stage 0 is skipped, actually it's executed. So I think it should be included in the total number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7603) Crash of thrift server when doing SQL without limit
[ https://issues.apache.org/jira/browse/SPARK-7603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7603: - Component/s: (was: Spark Core) SQL Crash of thrift server when doing SQL without limit - Key: SPARK-7603 URL: https://issues.apache.org/jira/browse/SPARK-7603 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Environment: Hortonworks Sandbox 2.1 with Spark 1.3.1 Reporter: Ihor Bobak I have 2 tables in hive: one with 120 thousand records, another one is 5 times smaller. I'm running a standalone cluster on single VM, and the thrift server with ./start-thriftserver.sh --conf spark.executor.memory=2048m --conf spark.driver.memory=1024m command. My spark-defaults.conf contains: spark.master spark://sandbox.hortonworks.com:7077 spark.eventLog.enabled true spark.eventLog.dir hdfs://sandbox.hortonworks.com:8020/user/pdi/spark/logs So, when I am running SQL select some fields from header, some fields from details from vw_salesorderdetail as d left join vw_salesorderheader as h on h.SalesOrderID = d.SalesOrderID limit 20; everything is fine, no matter that the limit is unreal (again: the resultset returned is just 12 records). But if I am running the same query without limit clause - I get hanging of execution - see here: http://postimg.org/image/fujdjd16f/42945a78/ and a lot of exceptions in the logs of thrift server - here you are: 15/05/13 17:59:27 INFO TaskSetManager: Starting task 158.0 in stage 48.0 (TID 953, sandbox.hortonworks.com, PROCESS_LOCAL, 1473 bytes) 15/05/13 18:00:01 INFO TaskSetManager: Finished task 150.0 in stage 48.0 (TID 945) in 36166 ms on sandbox.hortonworks.com (152/200) 15/05/13 18:00:02 ERROR Utils: Uncaught exception in thread Spark Context Cleaner java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:147) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:143) at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65) Exception in thread Spark Context Cleaner 15/05/13 18:00:02 ERROR Utils: Uncaught exception in thread task-result-getter-1 java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.String.init(String.java:315) at com.esotericsoftware.kryo.io.Input.readAscii(Input.java:562) at com.esotericsoftware.kryo.io.Input.readString(Input.java:436) at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157) at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146) at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:706) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293) at
[jira] [Updated] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7042: - Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) I think this is an Akka / Scala problem really, but can keep this open to track update Akka at some point. Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Priority: Minor When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6973) The total stages on the allJobsPage is wrong
[ https://issues.apache.org/jira/browse/SPARK-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6973: - Priority: Minor (was: Major) The total stages on the allJobsPage is wrong Key: SPARK-6973 URL: https://issues.apache.org/jira/browse/SPARK-6973 Project: Spark Issue Type: Bug Components: Web UI Reporter: meiyoula Priority: Minor Attachments: allJobs.png The job has two stages, map and collect stage. Both two retried two times. The first and second time of map stage is successful, and the third time skipped. Of collect stage, the first and second time is failed, and the third time is successful. On the allJobs page, the number of total stages is allStages-skippedStages. Mostly it's wright, but here I think total stages should be 2. The example: Stage 0: Map Stage Stage 1: Collect Stage Stage: Stage 0 - Stage 1 - Stage 0(retry 1) - Stage 1(retry 1) - Stage 0(retry 2) - Stage 1(retry 2) Status: Success - Fail -Success - Fail -Skipped - Success Though one of Stage 0 is skipped, actually it's executed. So I think it should be included in the total number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6056) Unlimit offHeap memory use cause RM killing the container
[ https://issues.apache.org/jira/browse/SPARK-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545362#comment-14545362 ] Sean Owen commented on SPARK-6056: -- I can't make out whether this is an issue or not. Do you just need to allow for more off-heap memory in YARN? Unlimit offHeap memory use cause RM killing the container - Key: SPARK-6056 URL: https://issues.apache.org/jira/browse/SPARK-6056 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.1 Reporter: SaintBacchus No matter set the `preferDirectBufs` or limit the number of thread or not ,spark can not limit the use of offheap memory. At line 269 of the class 'AbstractNioByteChannel' in netty-4.0.23.Final, Netty had allocated a offheap memory buffer with the same size in heap. So how many buffer you want to transfor, the same size offheap memory will be allocated. But once the allocated memory size reach the capacity of the overhead momery set in yarn, this executor will be killed. I wrote a simple code to test it: {code:title=test.scala|borderStyle=solid} import org.apache.spark.storage._ import org.apache.spark._ val bufferRdd = sc.makeRDD(0 to 10, 10).map(x=new Array[Byte](10*1024*1024)).persist bufferRdd.count val part = bufferRdd.partitions(0) val sparkEnv = SparkEnv.get val blockMgr = sparkEnv.blockManager def test = { val blockOption = blockMgr.get(RDDBlockId(bufferRdd.id, part.index)) val resultIt = blockOption.get.data.asInstanceOf[Iterator[Array[Byte]]] val len = resultIt.map(_.length).sum println(s[${Thread.currentThread.getId}] get block length = $len) } def test_driver(count:Int, parallel:Int)(f: = Unit) = { val tpool = new scala.concurrent.forkjoin.ForkJoinPool(parallel) val taskSupport = new scala.collection.parallel.ForkJoinTaskSupport(tpool) val parseq = (1 to count).par parseq.tasksupport = taskSupport parseq.foreach(x=f) tpool.shutdown tpool.awaitTermination(100, java.util.concurrent.TimeUnit.SECONDS) } {code} progress: 1. bin/spark-shell --master yarn-cilent --executor-cores 40 --num-executors 1 2. :load test.scala in spark-shell 3. use such comman to catch executor on slave node {code} pid=$(jps|grep CoarseGrainedExecutorBackend |awk '{print $1}');top -b -p $pid|grep $pid {code} 4. test_driver(20,100)(test) in spark-shell 5. watch the output of the command on slave node If use multi-thread to get len, the physical memery will soon exceed the limit set by spark.yarn.executor.memoryOverhead -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7503) Resources in .sparkStaging directory can't be cleaned up on error
[ https://issues.apache.org/jira/browse/SPARK-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7503: - Assignee: Kousuke Saruta Resources in .sparkStaging directory can't be cleaned up on error - Key: SPARK-7503 URL: https://issues.apache.org/jira/browse/SPARK-7503 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Fix For: 1.4.0 When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources. You can see this issue by running following command. {code} bin/spark-submit --master yarn --deploy-mode cluster --class someClassName non-existing-jar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7503) Resources in .sparkStaging directory can't be cleaned up on error
[ https://issues.apache.org/jira/browse/SPARK-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7503. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6026 [https://github.com/apache/spark/pull/6026] Resources in .sparkStaging directory can't be cleaned up on error - Key: SPARK-7503 URL: https://issues.apache.org/jira/browse/SPARK-7503 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0 Reporter: Kousuke Saruta Fix For: 1.4.0 When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources. You can see this issue by running following command. {code} bin/spark-submit --master yarn --deploy-mode cluster --class someClassName non-existing-jar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7664) Fix incorrect link paths of DAG.
Kousuke Saruta created SPARK-7664: - Summary: Fix incorrect link paths of DAG. Key: SPARK-7664 URL: https://issues.apache.org/jira/browse/SPARK-7664 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Priority: Minor In JobPage, we can jump a StagePage when we click corresponding box of DAG viz but the link path is incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7344) Spark hangs reading and writing to the same S3 bucket
[ https://issues.apache.org/jira/browse/SPARK-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545323#comment-14545323 ] Sean Owen commented on SPARK-7344: -- yes but the most recent script still runs with Hadoop 1.x code I think, and that has old S3 libs. I think this is an S3 client library problem, or at least, I would try a build with Hadoop 2.x bindings and later jets3t libs first. Spark hangs reading and writing to the same S3 bucket - Key: SPARK-7344 URL: https://issues.apache.org/jira/browse/SPARK-7344 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1 Environment: AWS EC2 Reporter: Daniel Mahler The following code will hang if the `outprefix` is in an S3 bucket def copy1 = s3n://mybucket/copy1 def copy2 = s3n://mybucket/copy2 val txt1 = sc.textFile(inpath) txt1.count val res = txt.saveAsTextFile(copy1) val txt2 = sc.textFile(copy1 +/part-*) txt2.count txt2.saveAsTextFile(copy2) // - HANGS HERE val txt3 = sc.textFile(copy2 +/part-*) txt3.count The problem goew away if copy1 and copy2 are in distinct S2 buckets or when using HDFS instead of S3 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7536) Audit MLlib Python API for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-7536: --- Description: For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? SPARK-7667 * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. SPARK-7666 * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. SPARK-7665 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python. ** classification *** StreamingLogisticRegressionWithSGD SPARK-7633 ** clustering *** GaussianMixture SPARK-6258 *** LDA SPARK-6259 *** Power Iteration Clustering SPARK-5962 *** StreamingKMeans SPARK-4118 ** evaluation *** MultilabelMetrics SPARK-6094 ** feature *** ElementwiseProduct SPARK-7605 *** PCA SPARK-7604 ** linalg *** Distributed linear algebra SPARK-6100 ** pmml.export SPARK-7638 ** regression *** StreamingLinearRegressionWithSGD SPARK-4127 ** stat *** KernelDensity SPARK-7639 ** util *** MLUtils SPARK-6263 was: For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. SPARK-7666 * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. SPARK-7665 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python. ** classification *** StreamingLogisticRegressionWithSGD SPARK-7633 ** clustering *** GaussianMixture SPARK-6258 *** LDA SPARK-6259 *** Power Iteration Clustering SPARK-5962 *** StreamingKMeans SPARK-4118 ** evaluation *** MultilabelMetrics SPARK-6094 ** feature *** ElementwiseProduct SPARK-7605 *** PCA SPARK-7604 ** linalg *** Distributed linear algebra SPARK-6100 ** pmml.export SPARK-7638 ** regression *** StreamingLinearRegressionWithSGD SPARK-4127 ** stat *** KernelDensity SPARK-7639 ** util *** MLUtils SPARK-6263 Audit MLlib Python API for 1.4 -- Key: SPARK-7536 URL: https://issues.apache.org/jira/browse/SPARK-7536 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Yanbo Liang For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? SPARK-7667 * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. SPARK-7666 * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. SPARK-7665 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python. ** classification *** StreamingLogisticRegressionWithSGD SPARK-7633 ** clustering *** GaussianMixture SPARK-6258 *** LDA SPARK-6259 *** Power Iteration Clustering SPARK-5962 *** StreamingKMeans SPARK-4118 ** evaluation *** MultilabelMetrics SPARK-6094 ** feature *** ElementwiseProduct SPARK-7605 *** PCA SPARK-7604 ** linalg *** Distributed linear algebra SPARK-6100 ** pmml.export SPARK-7638 ** regression *** StreamingLinearRegressionWithSGD SPARK-4127 ** stat *** KernelDensity SPARK-7639 ** util *** MLUtils SPARK-6263 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6527) sc.binaryFiles can not access files on s3
[ https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6527: - Component/s: EC2 Priority: Minor (was: Major) Is there any more detail on this? like stack traces or the code you're running? sc.binaryFiles can not access files on s3 - Key: SPARK-6527 URL: https://issues.apache.org/jira/browse/SPARK-6527 Project: Spark Issue Type: Bug Components: EC2, Input/Output Affects Versions: 1.2.0, 1.3.0 Environment: I am running Spark on EC2 Reporter: Zhao Zhang Priority: Minor The sc.binaryFIles() can not access the files stored on s3. It can correctly list the number of files, but report file does not exist when processing them. I also tried sc.textFile() which works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x
[ https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545375#comment-14545375 ] Konstantin Shaposhnikov commented on SPARK-7042: There is nothing wrong with the standard Akka 2.11 build. In fact we have a custom build of Spark now that uses standard Akka 2.3.9 from maven central repository without any problems. The error appears only with the custom build of akka (because it was compiled with buggy version of Scala) that comes with spark by default. I agree that number of users affected by this problem is probably quite small (only 1? ;) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x -- Key: SPARK-7042 URL: https://issues.apache.org/jira/browse/SPARK-7042 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Konstantin Shaposhnikov Priority: Minor When connecting to a remote Spark cluster (that runs Spark branch-1.3 built with Scala 2.11) from an application that uses akka 2.3.9 I get the following error: {noformat} 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] [sparkDriver-akka.actor.default-dispatcher-5] - Association with remote system [akka.tcp://sparkExecutor@server:59007] has failed, address is now gated for [5000] ms. Reason is: [akka.actor.Identify; local class incompatible: stream classdesc serialVersionUID = -213377755528332889, local class serialVersionUID = 1]. {noformat} It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations (see https://issues.scala-lang.org/browse/SI-8549). The following steps can resolve the issue: - re-build the custom akka library that is used by Spark with the more recent version of Scala compiler (e.g. 2.11.6) - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo - update version of akka used by spark (master and 1.3 branch) I would also suggest to upgrade to the latest version of akka 2.3.9 (or 2.3.10 that should be released soon). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5271) PySpark History Web UI issues
[ https://issues.apache.org/jira/browse/SPARK-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5271. -- Resolution: Not A Problem PySpark History Web UI issues - Key: SPARK-5271 URL: https://issues.apache.org/jira/browse/SPARK-5271 Project: Spark Issue Type: Bug Components: PySpark, Web UI Affects Versions: 1.2.0 Environment: PySpark 1.2.0 in yarn-client mode Reporter: Andrey Zimovnov After successful run of PySpark app via spark-submit in yarn-client mode on Hadoop 2.4 cluster the History UI shows the same as in issue SPARK-3898. {code} App Name:Not Started Started:1970/01/01 07:59:59 Spark User:Not Started Last Updated:2014/10/10 14:50:39 Exception message: 2014-10-10 14:51:14,284 - ERROR - org.apache.spark.Logging$class.logError(Logging.scala:96) - qtp1594785497-16851 -Exception in parsing Spark event log hdfs://wscluster/sparklogs/24.3g_15_5g_2c-1412923684977/EVENT_LOG_1 org.json4s.package$MappingException: Did not find value which can be converted into int at org.json4s.reflect.package$.fail(package.scala:96) at org.json4s.Extraction$.convert(Extraction.scala:554) at org.json4s.Extraction$.extract(Extraction.scala:331) at org.json4s.Extraction$.extract(Extraction.scala:42) at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21) at org.apache.spark.util.JsonProtocol$.blockManagerIdFromJson(JsonProtocol.scala:647) at org.apache.spark.util.JsonProtocol$.blockManagerAddedFromJson(JsonProtocol.scala:468) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:404) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55) at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$loadAppInfo(FsHistoryProvider.scala:181) at org.apache.spark.deploy.history.FsHistoryProvider.getAppUI(FsHistoryProvider.scala:99) at org.apache.spark.deploy.history.HistoryServer$$anon$3.load(HistoryServer.scala:55) at org.apache.spark.deploy.history.HistoryServer$$anon$3.load(HistoryServer.scala:53) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:88) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at
[jira] [Resolved] (SPARK-5265) Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master
[ https://issues.apache.org/jira/browse/SPARK-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5265. -- Resolution: Duplicate I think you described the same issue twice here; please close the old one if you're elaborating elsewhere. Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master -- Key: SPARK-5265 URL: https://issues.apache.org/jira/browse/SPARK-5265 Project: Spark Issue Type: Bug Components: Deploy Reporter: Roque Vassal'lo Labels: cluster, spark-submit, standalone, zookeeper Hi, this is my first JIRA here, so I hope it is clear enough. I'm using Spark 1.2.0 and trying to submit an application on a Spark Standalone cluster in cluster deploy mode with supervise. Standalone cluster is running in high availability mode, using Zookeeper to provide leader election between three available Masters (named master1, master2 and master3). As read at Spark's documentation, to register a Worker to the Standalone cluster, I provide complete cluster info as the spark route. I mean, spark://master1:7077,master2:7077,master3:7077 and that route is parsed and three attempts are launched, first one to master1:7077, second one to master2:7077 and third one to master3:7077. This works great! But if I try to do the same while submitting applications, it fails. I mean, if I provide complete cluster info as the --master option to spark-submit script, it throws an exception because it tries to connect as it was a single node. Example: spark-submit --class org.apache.spark.examples.SparkPi --master spark://master1:7077,master2:7077,master3:7077 --deploy-mode cluster --supervise examples.jar 100 This is the output I got: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/01/14 17:02:11 INFO SecurityManager: Changing view acls to: mytest 15/01/14 17:02:11 INFO SecurityManager: Changing modify acls to: mytest 15/01/14 17:02:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mytest); users with modify permissions: Set(mytest) 15/01/14 17:02:11 INFO Slf4jLogger: Slf4jLogger started 15/01/14 17:02:11 INFO Utils: Successfully started service 'driverClient' on port 53930. 15/01/14 17:02:11 ERROR OneForOneStrategy: Invalid master URL: spark://master1:7077,master2:7077,master3:7077 akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.spark.SparkException: Invalid master URL: spark://master1:7077,master2:7077,master3:7077 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830) at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42) at akka.actor.Actor$class.aroundPreStart(Actor.scala:470) at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35) at akka.actor.ActorCell.create(ActorCell.scala:580) ... 9 more Shouldn't it parse it as on Worker registration? It will not force client to know which is the current active Master of the Standalone cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5241) spark-ec2 spark init scripts do not handle all hadoop (or tachyon?) dependencies correctly
[ https://issues.apache.org/jira/browse/SPARK-5241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5241. -- Resolution: Invalid I don't understand the problem being reported here. Reopen if you can suggest a particular change. Maybe start by asking on user@ ? spark-ec2 spark init scripts do not handle all hadoop (or tachyon?) dependencies correctly -- Key: SPARK-5241 URL: https://issues.apache.org/jira/browse/SPARK-5241 Project: Spark Issue Type: Bug Components: Build, EC2 Reporter: Florian Verhein spark-ec2/spark/init.sh doesn't completely adhere to hadoop dependencies. This may also be an issue for tachyon dependencies. Related: tachyon appears require builds against the right version of hadoop also (probably causes this: SPARK-3185). Applies to the spark build from git checkout in spark/init.sh (I suspect this should also be changed to using mvn as that's the reference build according to the docs?). May apply to pre-built spark in spark/init.sh as well, but I'm not sure about this. E.g. I thought that the hadoop2.4 and cdh4.2 builds of spark are different. Also note that hadoop native is built from hadoop 2.4.1 on the AMI, and this is used regardless of HADOOP_MAJOR_VERSION in the *-hdfs modules. Tachyon is hard coded to 0.4.1 (which is probably built against hadoop1.x?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4808) Spark fails to spill with small number of large objects
[ https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545464#comment-14545464 ] Sean Owen commented on SPARK-4808: -- I think this is considered resolved now for 1.4 after https://github.com/apache/spark/commit/3be92cdac30cf488e09dbdaaa70e5c4cdaa9a099 ? but not 1.3. Maybe [~andrewor14] can confirm. Spark fails to spill with small number of large objects --- Key: SPARK-4808 URL: https://issues.apache.org/jira/browse/SPARK-4808 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1 Reporter: Dennis Lawler Spillable's maybeSpill does not allow spill to occur until at least 1000 elements have been spilled, and then will only evaluate spill every 32nd element thereafter. When there is a small number of very large items being tracked, out-of-memory conditions may occur. I suspect that this and the every-32nd-element behavior was to reduce the impact of the estimateSize() call. This method was extracted into SizeTracker, which implements its own exponential backup for size estimation, so now we are only avoiding using the resulting estimated size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4560) Lambda deserialization error
[ https://issues.apache.org/jira/browse/SPARK-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4560. -- Resolution: Not A Problem Lambda deserialization error Key: SPARK-4560 URL: https://issues.apache.org/jira/browse/SPARK-4560 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.1.1 Environment: Java 8.0.25 Reporter: Alexis Seigneurin Attachments: IndexTweets.java, pom.xml I'm getting an error saying a lambda could not be deserialized. Here is the code: {code} TwitterUtils.createStream(sc, twitterAuth, filters) .map(t - t.getText()) .foreachRDD(tweets - { tweets.foreach(x - System.out.println(x)); return null; }); {code} Here is the exception: {noformat} java.io.IOException: unexpected exception type at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104) ... 27 more Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization at com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1) ... 37 more {noformat} The weird thing is, if I write the following code (the map operation is inside the foreachRDD), it works without problem. {code} TwitterUtils.createStream(sc, twitterAuth, filters) .foreachRDD(tweets - { tweets.map(t - t.getText()) .foreach(x - System.out.println(x)); return null; }); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (SPARK-4556) binary distribution assembly can't run in local mode
[ https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4556: --- Assignee: Apache Spark binary distribution assembly can't run in local mode Key: SPARK-4556 URL: https://issues.apache.org/jira/browse/SPARK-4556 Project: Spark Issue Type: Bug Components: Build, Spark Shell Reporter: Sean Busbey Assignee: Apache Spark After building the binary distribution assembly, the resultant tarball can't be used for local mode. {code} busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package [INFO] Scanning for projects... ...SNIP... [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 s] [INFO] Spark Project Networking ... SUCCESS [ 31.402 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 8.864 s] [INFO] Spark Project Core . SUCCESS [15:39 min] [INFO] Spark Project Bagel SUCCESS [ 29.470 s] [INFO] Spark Project GraphX ... SUCCESS [05:20 min] [INFO] Spark Project Streaming SUCCESS [11:02 min] [INFO] Spark Project Catalyst . SUCCESS [11:26 min] [INFO] Spark Project SQL .. SUCCESS [11:33 min] [INFO] Spark Project ML Library ... SUCCESS [14:27 min] [INFO] Spark Project Tools SUCCESS [ 40.980 s] [INFO] Spark Project Hive . SUCCESS [11:45 min] [INFO] Spark Project REPL . SUCCESS [03:15 min] [INFO] Spark Project Assembly . SUCCESS [04:22 min] [INFO] Spark Project External Twitter . SUCCESS [ 43.567 s] [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 s] [INFO] Spark Project External Flume ... SUCCESS [01:41 min] [INFO] Spark Project External MQTT SUCCESS [ 40.973 s] [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 s] [INFO] Spark Project External Kafka ... SUCCESS [01:23 min] [INFO] Spark Project Examples . SUCCESS [10:19 min] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 01:47 h [INFO] Finished at: 2014-11-22T02:13:51-06:00 [INFO] Final Memory: 79M/2759M [INFO] busbey2-MBA:spark busbey$ cd assembly/target/ busbey2-MBA:target busbey$ mkdir dist-temp busbey2-MBA:target busbey$ tar -C dist-temp -xzf spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz busbey2-MBA:target busbey$ cd dist-temp/ busbey2-MBA:dist-temp busbey$ ./bin/spark-shell ls: /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10: No such file or directory Failed to find Spark assembly in /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10 You need to build Spark before running this program. {code} It looks like the classpath calculations in {{bin/compute_classpath.sh}} don't handle it. If I move all of the spark-*.jar files from the top level into the lib folder and touch the RELEASE file, then the spark shell launches in local mode normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3602) Can't run cassandra_inputformat.py
[ https://issues.apache.org/jira/browse/SPARK-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3602. -- Resolution: Not A Problem I think this is due to mismatching Hadoop libs, or at least is stale enough at this point that I think it should be closed. Can't run cassandra_inputformat.py -- Key: SPARK-3602 URL: https://issues.apache.org/jira/browse/SPARK-3602 Project: Spark Issue Type: Bug Components: Examples, PySpark Affects Versions: 1.1.0 Environment: Ubuntu 14.04 Reporter: Frens Jan Rumph When I execute: {noformat} wget http://apache.cs.uu.nl/dist/spark/spark-1.1.0/spark-1.1.0-bin-hadoop2.4.tgz tar xzf spark-1.1.0-bin-hadoop2.4.tgz cd spark-1.1.0-bin-hadoop2.4/ ./bin/spark-submit --jars lib/spark-examples-1.1.0-hadoop2.4.0.jar examples/src/main/python/cassandra_inputformat.py localhost keyspace cf {noformat} The output is: {noformat} Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/09/19 10:41:10 WARN Utils: Your hostname, laptop-x resolves to a loopback address: 127.0.0.1; using 192.168.2.2 instead (on interface wlan0) 14/09/19 10:41:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/09/19 10:41:10 INFO SecurityManager: Changing view acls to: frens-jan, 14/09/19 10:41:10 INFO SecurityManager: Changing modify acls to: frens-jan, 14/09/19 10:41:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(frens-jan, ); users with modify permissions: Set(frens-jan, ) 14/09/19 10:41:11 INFO Slf4jLogger: Slf4jLogger started 14/09/19 10:41:11 INFO Remoting: Starting remoting 14/09/19 10:41:11 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@laptop-x.local:43790] 14/09/19 10:41:11 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@laptop-x.local:43790] 14/09/19 10:41:11 INFO Utils: Successfully started service 'sparkDriver' on port 43790. 14/09/19 10:41:11 INFO SparkEnv: Registering MapOutputTracker 14/09/19 10:41:11 INFO SparkEnv: Registering BlockManagerMaster 14/09/19 10:41:11 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140919104111-145e 14/09/19 10:41:11 INFO Utils: Successfully started service 'Connection manager for block manager' on port 45408. 14/09/19 10:41:11 INFO ConnectionManager: Bound socket to port 45408 with id = ConnectionManagerId(laptop-x.local,45408) 14/09/19 10:41:11 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 14/09/19 10:41:11 INFO BlockManagerMaster: Trying to register BlockManager 14/09/19 10:41:11 INFO BlockManagerMasterActor: Registering block manager laptop-x.local:45408 with 265.4 MB RAM 14/09/19 10:41:11 INFO BlockManagerMaster: Registered BlockManager 14/09/19 10:41:11 INFO HttpFileServer: HTTP File server directory is /tmp/spark-5f0289d7-9b20-4bd7-a713-db84c38c4eac 14/09/19 10:41:11 INFO HttpServer: Starting HTTP Server 14/09/19 10:41:11 INFO Utils: Successfully started service 'HTTP file server' on port 36556. 14/09/19 10:41:11 INFO Utils: Successfully started service 'SparkUI' on port 4040. 14/09/19 10:41:11 INFO SparkUI: Started SparkUI at http://laptop-frens-jan.local:4040 14/09/19 10:41:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/09/19 10:41:12 INFO SparkContext: Added JAR file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop2.4/lib/spark-examples-1.1.0-hadoop2.4.0.jar at http://192.168.2.2:36556/jars/spark-examples-1.1.0-hadoop2.4.0.jar with timestamp 146072417 14/09/19 10:41:12 INFO Utils: Copying /home/frens-jan/Desktop/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/cassandra_inputformat.py to /tmp/spark-7dbb1b4d-016c-4f8b-858d-f79c9297f58f/cassandra_inputformat.py 14/09/19 10:41:12 INFO SparkContext: Added file file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/cassandra_inputformat.py at http://192.168.2.2:36556/files/cassandra_inputformat.py with timestamp 146072419 14/09/19 10:41:12 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@laptop-frens-jan.local:43790/user/HeartbeatReceiver 14/09/19 10:41:12 INFO MemoryStore: ensureFreeSpace(167659) called with curMem=0, maxMem=278302556 14/09/19 10:41:12 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 163.7 KB, free 265.3 MB) 14/09/19 10:41:12 INFO MemoryStore: ensureFreeSpace(167659) called with curMem=167659, maxMem=278302556 14/09/19 10:41:12 INFO MemoryStore: Block broadcast_1 stored as values in memory
[jira] [Commented] (SPARK-2445) MesosExecutorBackend crashes in fine grained mode
[ https://issues.apache.org/jira/browse/SPARK-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545504#comment-14545504 ] Sean Owen commented on SPARK-2445: -- [~gbow...@fastmail.co.uk] are you saying that SPARK-3535 did actually resolve this? MesosExecutorBackend crashes in fine grained mode - Key: SPARK-2445 URL: https://issues.apache.org/jira/browse/SPARK-2445 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.0.0 Reporter: Dario Rexin When multiple instances of the MesosExecutorBackend are running on the same slave, they will have the same executorId assigned (equal to the mesos slaveId), but will have a different port (which is randomly assigned). Because of this, it can not register a new BlockManager, because one is already registered with the same executorId, but a different BlockManagerId. More description and a fix can be found in this PR on GitHub: https://github.com/apache/spark/pull/1358 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1928) DAGScheduler suspended by local task OOM
[ https://issues.apache.org/jira/browse/SPARK-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1928. -- Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Peng Zhen Resolved long ago by https://github.com/apache/spark/pull/883 DAGScheduler suspended by local task OOM Key: SPARK-1928 URL: https://issues.apache.org/jira/browse/SPARK-1928 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 0.9.0 Reporter: Peng Zhen Assignee: Peng Zhen Fix For: 1.1.0 DAGScheduler does not handle local task OOM properly, and will wait for the job result forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2133) FileNotFoundException in BlockObjectWriter
[ https://issues.apache.org/jira/browse/SPARK-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2133. -- Resolution: Cannot Reproduce FileNotFoundException in BlockObjectWriter -- Key: SPARK-2133 URL: https://issues.apache.org/jira/browse/SPARK-2133 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: YARN Reporter: Neville Li Seeing a lot of this when running ALS on large data sets ( 50GB) and YARN. The job eventually fails after spark.task.maxFailures has been reached. {code} Exception in thread Thread-3 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:677 failed 10 times, most recent failure: Exception failure in TID 946 on host lon4-hadoopslave-b501.lon4.spotify.net: java.io.FileNotFoundException: /disk/hd01/yarn/local/usercache/neville/appcache/application_1401944843353_36952/spark-local-20140611033053-6b18/2a/shuffle_0_677_985 (No such file or directory) java.io.FileOutputStream.openAppend(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:192) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177) org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161) org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545533#comment-14545533 ] Guillaume E.B. commented on SPARK-4105: --- I think I add the bug using another compression codec. I will try to reproduce it as soon as I can. FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle - Key: SPARK-4105 URL: https://issues.apache.org/jira/browse/SPARK-4105 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Blocker Attachments: JavaObjectToSerialize.java, SparkFailedToUncompressGenerator.scala We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during shuffle read. Here's a sample stacktrace from an executor: {code} 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 33053) java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58) at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Here's another occurrence of a similar error:
[jira] [Commented] (SPARK-5220) keepPushingBlocks in BlockGenerator terminated when an exception occurs, which causes the block pushing thread to terminate and blocks receiver
[ https://issues.apache.org/jira/browse/SPARK-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545456#comment-14545456 ] Sean Owen commented on SPARK-5220: -- [~superxma] is this resolved then? keepPushingBlocks in BlockGenerator terminated when an exception occurs, which causes the block pushing thread to terminate and blocks receiver - Key: SPARK-5220 URL: https://issues.apache.org/jira/browse/SPARK-5220 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Max Xu I am running a Spark streaming application with ReliableKafkaReceiver. It uses BlockGenerator to push blocks to BlockManager. However, writing WALs to HDFS may time out that causes keepPushingBlocks in BlockGenerator to terminate. 15/01/12 19:07:06 ERROR receiver.BlockGenerator: Error in block pushing thread java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:176) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:160) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:126) at org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:124) at org.apache.spark.streaming.kafka.ReliableKafkaReceiver.org$apache$spark$streaming$kafka$ReliableKafkaReceiver$$storeBlockAndCommitOffset(ReliableKafkaReceiver.scala:207) at org.apache.spark.streaming.kafka.ReliableKafkaReceiver$GeneratedBlockHandler.onPushBlock(ReliableKafkaReceiver.scala:275) at org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:181) at org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:154) at org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:86) Then the block pushing thread is done and no subsequent blocks can be pushed into blockManager. In turn this blocks receiver from receiving new data. So when running my app and the TimeoutException happens, the ReliableKafkaReceiver stays in ACTIVE status but doesn't do anything at all. The application rogues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5175) bug in updating counters when starting multiple workers/supervisors in actor-based receiver
[ https://issues.apache.org/jira/browse/SPARK-5175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5175: --- Assignee: (was: Apache Spark) bug in updating counters when starting multiple workers/supervisors in actor-based receiver --- Key: SPARK-5175 URL: https://issues.apache.org/jira/browse/SPARK-5175 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Nan Zhu when starting multiple workers(ActorReceiver.scala), we didn't update the counters in it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5174) Missing Document for starting multiple workers/supervisors in actor-based receiver
[ https://issues.apache.org/jira/browse/SPARK-5174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5174: --- Assignee: (was: Apache Spark) Missing Document for starting multiple workers/supervisors in actor-based receiver -- Key: SPARK-5174 URL: https://issues.apache.org/jira/browse/SPARK-5174 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Reporter: Nan Zhu Priority: Minor Currently, the document about starting multiple supervisors/workers are missing, though the implementation provides this capacity {code:title=ActorReceiver.scala|borderStyle=solid} case props: Props = val worker = context.actorOf(props) logInfo(Started receiver worker at: + worker.path) sender ! worker case (props: Props, name: String) = val worker = context.actorOf(props, name) logInfo(Started receiver worker at: + worker.path) sender ! worker case _: PossiblyHarmful = hiccups.incrementAndGet() case _: Statistics = val workers = context.children sender ! Statistics(n.get, workers.size, hiccups.get, workers.mkString(\n)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7536) Audit MLlib Python API for 1.4
[ https://issues.apache.org/jira/browse/SPARK-7536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-7536: --- Description: For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. SPARK-7665 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python. ** classification *** StreamingLogisticRegressionWithSGD SPARK-7633 ** clustering *** GaussianMixture SPARK-6258 *** LDA SPARK-6259 *** Power Iteration Clustering SPARK-5962 *** StreamingKMeans SPARK-4118 ** evaluation *** MultilabelMetrics SPARK-6094 ** feature *** ElementwiseProduct SPARK-7605 *** PCA SPARK-7604 ** linalg *** Distributed linear algebra SPARK-6100 ** pmml.export SPARK-7638 ** regression *** StreamingLinearRegressionWithSGD SPARK-4127 ** stat *** KernelDensity SPARK-7639 ** util *** MLUtils SPARK-6263 was: For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python. ** classification *** StreamingLogisticRegressionWithSGD SPARK-7633 ** clustering *** GaussianMixture SPARK-6258 *** LDA SPARK-6259 *** Power Iteration Clustering SPARK-5962 *** StreamingKMeans SPARK-4118 ** evaluation *** MultilabelMetrics SPARK-6094 ** feature *** ElementwiseProduct SPARK-7605 *** PCA SPARK-7604 ** linalg *** Distributed linear algebra SPARK-6100 ** pmml.export SPARK-7638 ** regression *** StreamingLinearRegressionWithSGD SPARK-4127 ** stat *** KernelDensity SPARK-7639 ** util *** MLUtils SPARK-6263 Audit MLlib Python API for 1.4 -- Key: SPARK-7536 URL: https://issues.apache.org/jira/browse/SPARK-7536 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Yanbo Liang For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. SPARK-7665 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python. ** classification *** StreamingLogisticRegressionWithSGD SPARK-7633 ** clustering *** GaussianMixture SPARK-6258 *** LDA SPARK-6259 *** Power Iteration Clustering SPARK-5962 *** StreamingKMeans SPARK-4118 ** evaluation *** MultilabelMetrics SPARK-6094 ** feature *** ElementwiseProduct SPARK-7605 *** PCA SPARK-7604 ** linalg *** Distributed linear algebra SPARK-6100 ** pmml.export SPARK-7638 ** regression *** StreamingLinearRegressionWithSGD SPARK-4127 ** stat *** KernelDensity SPARK-7639 ** util *** MLUtils SPARK-6263 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7667) MLlib Python API consistency check
Yanbo Liang created SPARK-7667: -- Summary: MLlib Python API consistency check Key: SPARK-7667 URL: https://issues.apache.org/jira/browse/SPARK-7667 Project: Spark Issue Type: Task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Check and ensure the MLlib Python API(class/method/parameter) consistent with Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks
[ https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4598: - Issue Type: Improvement (was: Bug) Paginate stage page to avoid OOM with 100,000 tasks - Key: SPARK-4598 URL: https://issues.apache.org/jira/browse/SPARK-4598 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: meiyoula In HistoryServer stage page, clicking the task href in Description, it occurs the GC error. The detail error message is: 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-352] | Error for /history/application_1416206401491_0010/stages/stage/ | org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590) java.lang.OutOfMemoryError: GC overhead limit exceeded 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-364] | handle failed | org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697) java.lang.OutOfMemoryError: GC overhead limit exceeded -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1910) Add onBlockComplete API to receiver
[ https://issues.apache.org/jira/browse/SPARK-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1910: - Issue Type: Improvement (was: Bug) Add onBlockComplete API to receiver --- Key: SPARK-1910 URL: https://issues.apache.org/jira/browse/SPARK-1910 Project: Spark Issue Type: Improvement Components: Block Manager Reporter: Hari Shreedharan This can allow the receiver to ACK all data that has already been successfully stored by the block generator. This means the receiver's store methods must now receive the block Id, so the receiver can recognize which events are the ones that have been stored -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1107) Add shutdown hook on executor stop to stop running tasks
[ https://issues.apache.org/jira/browse/SPARK-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1107: - Issue Type: Improvement (was: Bug) We have a shutdown hook that stops the SparkContext, which is kind of related. Add shutdown hook on executor stop to stop running tasks Key: SPARK-1107 URL: https://issues.apache.org/jira/browse/SPARK-1107 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0 Reporter: Andrew Ash Originally reported by aash: http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3CCA%2B-p3AHXYhpjXH9fr8jQ5%2B_gc%3DNHjLbOiJB9bHSahfEET5aHBQ%40mail.gmail.com%3E Latest in thread: http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3CCA+-p3AFi7vz=2oty3caa0g+5ekg+a84uvqrl9tgstvgwgyb...@mail.gmail.com%3E The most popular approach is to add a shutdown hook that stops running tasks in the executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-604) reconnect if mesos slaves dies
[ https://issues.apache.org/jira/browse/SPARK-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-604. - Resolution: Cannot Reproduce Stale at this point, without similar findings recently. reconnect if mesos slaves dies -- Key: SPARK-604 URL: https://issues.apache.org/jira/browse/SPARK-604 Project: Spark Issue Type: Bug Components: Mesos when running on mesos, if a slave goes down, spark doesn't try to reassign the work to another machine. Even if the slave comes back up, the job is doomed. Currently when this happens, we just see this in the driver logs: 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: Mesos slave lost: 201210312057-1560611338-5050-24091-52 Exception in thread Thread-346 java.util.NoSuchElementException: key not found: value: 201210312057-1560611338-5050-24091-52 at scala.collection.MapLike$class.default(MapLike.scala:224) at scala.collection.mutable.HashMap.default(HashMap.scala:43) at scala.collection.MapLike$class.apply(MapLike.scala:135) at scala.collection.mutable.HashMap.apply(HashMap.scala:43) at spark.scheduler.cluster.ClusterScheduler.slaveLost(ClusterScheduler.scala:255) at spark.scheduler.mesos.MesosSchedulerBackend.slaveLost(MesosSchedulerBackend.scala:275) 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: driver.run() returned with code DRIVER_ABORTED -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5331) Spark workers can't find tachyon master as spark-ec2 doesn't set spark.tachyonStore.url
[ https://issues.apache.org/jira/browse/SPARK-5331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545441#comment-14545441 ] Sean Owen commented on SPARK-5331: -- [~florianverhein] is this an issue then or just a matter of setting the config correctly? Spark workers can't find tachyon master as spark-ec2 doesn't set spark.tachyonStore.url --- Key: SPARK-5331 URL: https://issues.apache.org/jira/browse/SPARK-5331 Project: Spark Issue Type: Bug Components: EC2 Environment: Running on EC2 via modified spark-ec2 scripts (to get dependencies right so tachyon starts) Using tachyon 0.5.0 built against hadoop 2.4.1 Spark 1.2.0 built against tachyon 0.5.0 and hadoop 0.4.1 Tachyon configured using the template in 0.5.0 but updated with slave list and master variables etc.. Reporter: Florian Verhein ps -ef | grep Tachyon shows Tachyon running on the master (and the slave) node with correct setting: -Dtachyon.master.hostname=ec2-54-252-156-187.ap-southeast-2.compute.amazonaws.com However from stderr log on worker running the SparkTachyonPi example: 15/01/20 06:00:56 INFO CacheManager: Partition rdd_0_0 not found, computing it 15/01/20 06:00:56 INFO : Trying to connect master @ localhost/127.0.0.1:19998 15/01/20 06:00:56 ERROR : Failed to connect (1) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:00:57 ERROR : Failed to connect (2) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:00:58 ERROR : Failed to connect (3) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:00:59 ERROR : Failed to connect (4) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:01:00 ERROR : Failed to connect (5) to master localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused 15/01/20 06:01:01 WARN TachyonBlockManager: Attempt 1 to create tachyon dir null failed java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 after 5 attempts at tachyon.client.TachyonFS.connect(TachyonFS.java:293) at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011) at tachyon.client.TachyonFS.exist(TachyonFS.java:633) at org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117) at org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106) at org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57) at org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:94) at org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:88) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:773) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master localhost/127.0.0.1:19998 after 5 attempts at tachyon.master.MasterClient.connect(MasterClient.java:178) at tachyon.client.TachyonFS.connect(TachyonFS.java:290) ... 28 more Caused by:
[jira] [Resolved] (SPARK-5246) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve
[ https://issues.apache.org/jira/browse/SPARK-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5246. -- Resolution: Done Assignee: Vladimir Grigor (Really, was fixed by a PR for mesos) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve -- Key: SPARK-5246 URL: https://issues.apache.org/jira/browse/SPARK-5246 Project: Spark Issue Type: Bug Components: EC2 Reporter: Vladimir Grigor Assignee: Vladimir Grigor How to reproduce: 1) http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html should be sufficient to setup VPC for this bug. After you followed that guide, start new instance in VPC, ssh to it (though NAT server) 2) user starts a cluster in VPC: {code} ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 --subnet-id=subnet-2571dd4d --zone=eu-west-1a launch SparkByScript Setting up security groups... (omitted for brevity) 10.1.1.62 10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop no org.apache.spark.deploy.master.Master to stop starting org.apache.spark.deploy.master.Master, logging to /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out failed to launch org.apache.spark.deploy.master.Master: at java.net.InetAddress.getLocalHost(InetAddress.java:1469) ... 12 more full log in /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out 10.1.1.62: starting org.apache.spark.deploy.worker.Worker, logging to /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out 10.1.1.62: failed to launch org.apache.spark.deploy.worker.Worker: 10.1.1.62:at java.net.InetAddress.getLocalHost(InetAddress.java:1469) 10.1.1.62:... 12 more 10.1.1.62: full log in /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out [timing] spark-standalone setup: 00h 00m 28s (omitted for brevity) {code} /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out {code} Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp :::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip 10.1.1.151 --port 7077 --webui-port 8080 15/01/14 07:34:47 INFO master.Master: Registered signal handlers for [TERM, HUP, INT] Exception in thread main java.net.UnknownHostException: ip-10-1-1-151: ip-10-1-1-151: Name or service not known at java.net.InetAddress.getLocalHost(InetAddress.java:1473) at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:620) at org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:612) at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:612) at org.apache.spark.util.Utils$.localIpAddressHostname$lzycompute(Utils.scala:613) at org.apache.spark.util.Utils$.localIpAddressHostname(Utils.scala:613) at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665) at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.util.Utils$.localHostName(Utils.scala:665) at org.apache.spark.deploy.master.MasterArguments.init(MasterArguments.scala:27) at org.apache.spark.deploy.master.Master$.main(Master.scala:819) at org.apache.spark.deploy.master.Master.main(Master.scala) Caused by: java.net.UnknownHostException: ip-10-1-1-151: Name or service not known at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) at java.net.InetAddress.getLocalHost(InetAddress.java:1469) ... 12 more {code} Problem is that instance launched in VPC may be not able to resolve own local hostname. Please see https://forums.aws.amazon.com/thread.jspa?threadID=92092. I am going to submit a fix for this problem since I need this functionality asap. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (SPARK-3942) LogisticRegressionWithLBFGS should not use SquaredL2Updater
[ https://issues.apache.org/jira/browse/SPARK-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3942. -- Resolution: Won't Fix LogisticRegressionWithLBFGS should not use SquaredL2Updater Key: SPARK-3942 URL: https://issues.apache.org/jira/browse/SPARK-3942 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: fuminglin LBFGS method use line search for step size, in all mllib`s updater use step-size decreasing with the square root of the number of iterations, this may cause Wolfe condition not hold. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions
[ https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3967. -- Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Christophe Préaud Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions - Key: SPARK-3967 URL: https://issues.apache.org/jira/browse/SPARK-3967 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Christophe Préaud Assignee: Christophe Préaud Fix For: 1.2.0 Attachments: spark-1.1.0-utils-fetch.patch, spark-1.1.0-yarn_cluster_tmpdir.patch Spark applications fail from time to time in yarn-cluster mode (but not in yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is set to a comma-separated list of directories which are located on different disks/partitions. Steps to reproduce: 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of directories located on different partitions (the more you set, the more likely it will be to reproduce the bug): (...) property nameyarn.nodemanager.local-dirs/name valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value /property (...) 2. Launch (several times) an application in yarn-cluster mode, it will fail (apparently randomly) from time to time -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org