[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544987#comment-14544987
 ] 

Josh Rosen commented on SPARK-7660:
---

Note that this affects more than just Spark 1.4.0; I'll trace back and figure 
out the complete list of affected versions tomorrow, but I think that any 
version that relied on a Snappy-java library published after mid June or July 
2014 may be affected.

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice.
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545005#comment-14545005
 ] 

Josh Rosen commented on SPARK-7660:
---

I pushed 
https://github.com/apache/spark/commit/7da33ce5057ff965eec19ce662465b64a3564019 
as a hotfix, which masks the bug in a way that fixes the JavaAPISuite Jenkins 
failures.  We'll still fix this bug before 1.4, but in the meantime this will 
make it easy to recognize new Jenkins failures.

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice (see 
 https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
 details).
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-7660:
-

Assignee: Josh Rosen

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice (see 
 https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
 details).
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7662) Exception of multi-attribute generator anlysis in projection

2015-05-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545072#comment-14545072
 ] 

Apache Spark commented on SPARK-7662:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6178

 Exception of multi-attribute generator anlysis in projection
 

 Key: SPARK-7662
 URL: https://issues.apache.org/jira/browse/SPARK-7662
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 {code}
 select explode(map(value, key)) from src;
 {code}
 It throws exception like
 {panel}
 org.apache.spark.sql.AnalysisException: The number of aliases supplied in the 
 AS clause does not match the number of columns output by the UDTF expected 2 
 aliases but got _c0 ;
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7662) Exception of multi-attribute generator anlysis in projection

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7662:
---

Assignee: Apache Spark

 Exception of multi-attribute generator anlysis in projection
 

 Key: SPARK-7662
 URL: https://issues.apache.org/jira/browse/SPARK-7662
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Apache Spark
Priority: Blocker

 {code}
 select explode(map(value, key)) from src;
 {code}
 It throws exception like
 {panel}
 org.apache.spark.sql.AnalysisException: The number of aliases supplied in the 
 AS clause does not match the number of columns output by the UDTF expected 2 
 aliases but got _c0 ;
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7662) Exception of multi-attribute generator anlysis in projection

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7662:
---

Assignee: (was: Apache Spark)

 Exception of multi-attribute generator anlysis in projection
 

 Key: SPARK-7662
 URL: https://issues.apache.org/jira/browse/SPARK-7662
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 {code}
 select explode(map(value, key)) from src;
 {code}
 It throws exception like
 {panel}
 org.apache.spark.sql.AnalysisException: The number of aliases supplied in the 
 AS clause does not match the number of columns output by the UDTF expected 2 
 aliases but got _c0 ;
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-7660:
-

 Summary: Snappy-java buffer-sharing bug leads to data corruption / 
test failures
 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker


snappy-java contains a bug that can lead to situations where separate 
SnappyOutputStream instances end up sharing the same input and output buffers, 
which can lead to data corruption issues.  See 
https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and 
https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue.

I discovered this issue because the buffer-sharing was leading to a test 
failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
the wrong answer because both tasks wrote their output using the same 
compression buffers and one task won the race, causing its output to be written 
to both shuffle output files. As a result, the test returned the result of 
collecting one partition twice.

The buffer-sharing can only occur if {{close()}} is called twice on the same 
SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a 
more precise description of when this issue may occur, see my upstream 
tickets).  I think that this double-close happens somewhere in some test code 
that was added as part of my Tungsten shuffle patch, exposing this bug (to see 
this, download a recent build of master and run 
https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
force the test execution order that triggers the bug).

I think that it's rare that this bug would lead to silent failures like this. 
In more realistic workloads that aren't writing only a handful of bytes per 
task, I would expect this issue to lead to stream corruption issues like 
SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2015-05-15 Thread Alex (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545027#comment-14545027
 ] 

Alex commented on SPARK-2344:
-

Hi,

How are you? I have couple of questions:

1) When are you planning to submit the FCM to the main spark branch? (I'm
interested working on top of it for Feature Weight FCM improvements)

2) How to know if there is a way for Spark to make the RDD distribution
based on input data columns rather then rows ?


​Thanks,
Alex


 Add Fuzzy C-Means algorithm to MLlib
 

 Key: SPARK-2344
 URL: https://issues.apache.org/jira/browse/SPARK-2344
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Alex
Priority: Minor
  Labels: clustering
   Original Estimate: 1m
  Remaining Estimate: 1m

 I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
 FCM is very similar to K - Means which is already implemented, and they 
 differ only in the degree of relationship each point has with each cluster:
 (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
 As part of the implementation I would like:
 - create a base class for K- Means and FCM
 - implement the relationship for each algorithm differently (in its class)
 I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6747) Support List as a return type in Hive UDF

2015-05-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545077#comment-14545077
 ] 

Apache Spark commented on SPARK-6747:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/6179

 Support List as a return type in Hive UDF
 ---

 Key: SPARK-6747
 URL: https://issues.apache.org/jira/browse/SPARK-6747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takeshi Yamamuro

 The current implementation can't handle List as a return type in Hive UDF.
 We assume an UDF below;
 public class UDFToListString extends UDF {
 public ListString evaluate(Object o) {
 return Arrays.asList(xxx, yyy, zzz);
 }
 }
 An exception of scala.MatchError is thrown as follows when the UDF used;
 scala.MatchError: interface java.util.List (of class java.lang.Class)
   at 
 org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
   at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
   at 
 scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
   at 
 scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
 ...
 To fix this problem, we need to add an entry for List in 
 HiveInspectors#javaClassToDataType.
 However, it has one difficulty because of type erasure in JVM.
 We assume that lines below are appended in HiveInspectors#javaClassToDataType;
 // list type
 case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] =
 val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType]
 println(tpe.getActualTypeArguments()(0).toString()) = 'E'
 This logic fails to catch a component type in List.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7660:
--
Description: 
snappy-java contains a bug that can lead to situations where separate 
SnappyOutputStream instances end up sharing the same input and output buffers, 
which can lead to data corruption issues.  See 
https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and 
https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue.

I discovered this issue because the buffer-sharing was leading to a test 
failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
the wrong answer because both tasks wrote their output using the same 
compression buffers and one task won the race, causing its output to be written 
to both shuffle output files. As a result, the test returned the result of 
collecting one partition twice (see 
https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
details).

The buffer-sharing can only occur if {{close()}} is called twice on the same 
SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a 
more precise description of when this issue may occur, see my upstream 
tickets).  I think that this double-close happens somewhere in some test code 
that was added as part of my Tungsten shuffle patch, exposing this bug (to see 
this, download a recent build of master and run 
https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
force the test execution order that triggers the bug).

I think that it's rare that this bug would lead to silent failures like this. 
In more realistic workloads that aren't writing only a handful of bytes per 
task, I would expect this issue to lead to stream corruption issues like 
SPARK-4105.

  was:
snappy-java contains a bug that can lead to situations where separate 
SnappyOutputStream instances end up sharing the same input and output buffers, 
which can lead to data corruption issues.  See 
https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and 
https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue.

I discovered this issue because the buffer-sharing was leading to a test 
failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
the wrong answer because both tasks wrote their output using the same 
compression buffers and one task won the race, causing its output to be written 
to both shuffle output files. As a result, the test returned the result of 
collecting one partition twice.

The buffer-sharing can only occur if {{close()}} is called twice on the same 
SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a 
more precise description of when this issue may occur, see my upstream 
tickets).  I think that this double-close happens somewhere in some test code 
that was added as part of my Tungsten shuffle patch, exposing this bug (to see 
this, download a recent build of master and run 
https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
force the test execution order that triggers the bug).

I think that it's rare that this bug would lead to silent failures like this. 
In more realistic workloads that aren't writing only a handful of bytes per 
task, I would expect this issue to lead to stream corruption issues like 
SPARK-4105.


 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice (see 
 https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
 details).
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this 

[jira] [Updated] (SPARK-6258) Python MLlib API missing items: Clustering

2015-05-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6258:
-
Fix Version/s: 1.4.0

 Python MLlib API missing items: Clustering
 --

 Key: SPARK-6258
 URL: https://issues.apache.org/jira/browse/SPARK-6258
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang
 Fix For: 1.4.0


 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k
 GaussianMixture
 * setInitialModel
 GaussianMixtureModel
 * k
 Completely missing items which should be fixed in separate JIRAs (which have 
 been created and linked to the umbrella JIRA)
 * LDA
 * PowerIterationClustering
 * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7591) FSBasedRelation interface tweaks

2015-05-15 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-7591.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6150
[https://github.com/apache/spark/pull/6150]

 FSBasedRelation interface tweaks
 

 Key: SPARK-7591
 URL: https://issues.apache.org/jira/browse/SPARK-7591
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.4.0


 # Renaming {{FSBasedRelation}} to {{HadoopFsRelation}}
   Since itss all coupled with Hadoop {{FileSystem}} and job API.
 # {{HadoopFsRelation}} should have a no-arg constructor
   {{paths}} and {{partitionColumns}} should just be methods to be overridden, 
 rather than constructor arguments. This makes data source developers life 
 easier by having a no-arg constructor and being serialization friendly.
 # Renaming {{HadoopFsRelation.prepareForWrite}} to 
 {{HadoopFsRelation.prepareJobForWrite}}
   The new name explicitly suggests developers should only touch the {{Job}} 
 instance for preparation work (which is also documented in Scaladoc).
 # Allowing serialization while creating {{OutputWriter}}s
   To be more precise, {{OutputWriter}}s are never created on driver side and 
 serialized to executor side. But the factory that creates {{OutputWriter}}s 
 should be created on driver side and serialized.
   The reason behind this is that, passing all needed materials to 
 {{OutputWriter}} instances via Hadoop Configuration is doable but sometimes 
 neither intuitive nor convenient. Resorting to serialization makes data 
 source developers' life easier. Actually this happens when I was migrating 
 the Parquet data source, and wanted to pass the final output path (instead of 
 temporary work path) to the output writer (see 
 [here|https://github.com/liancheng/spark/commit/ec9950c591e5b981ce20fab96562db28488e0035#diff-53521d336f7259e859fea4d3ca4dc888R74]).
  There I have to put a property into the Configuration object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7621) Report KafkaReceiver MessageHandler errors so StreamingListeners can take action

2015-05-15 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545105#comment-14545105
 ] 

Saisai Shao commented on SPARK-7621:


Hi [~jerluc], you could submit a related PR on Github, Spark community submits 
patch on Github rather than JIRA.

 Report KafkaReceiver MessageHandler errors so StreamingListeners can take 
 action
 

 Key: SPARK-7621
 URL: https://issues.apache.org/jira/browse/SPARK-7621
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0, 1.3.1
Reporter: Jeremy A. Lucas
 Fix For: 1.3.1

 Attachments: SPARK-7621.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, when a MessageHandler (for any of the Kafka Receiver 
 implementations) encounters an error handling a message, the error is only 
 logged with:
 {code:none}
 case e: Exception = logError(Error handling message, e)
 {code}
 It would be _incredibly_ useful to be able to notify any registered 
 StreamingListener of this receiver error (especially since this 
 {{try...catch}} block masks more fatal Kafka connection exceptions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7269) Incorrect aggregation analysis

2015-05-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544979#comment-14544979
 ] 

Apache Spark commented on SPARK-7269:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/6173

 Incorrect aggregation analysis
 --

 Key: SPARK-7269
 URL: https://issues.apache.org/jira/browse/SPARK-7269
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 In a case insensitive analyzer (HiveContext), the attribute name captial 
 differences will fail the analysis check for aggregation.
 {code}
 test(check analysis failed in case in-sensitive) {
 Seq(1,2,3).map(i = (i, i.toString)).toDF(key, 
 value).registerTempTable(df_analysis)
 sql(SELECT kEy from df_analysis group by key)
 }
 {code}
 {noformat}
 expression 'kEy' is neither present in the group by, nor is it an aggregate 
 function. Add to group by or wrap in first() if you don't care which value 
 you get.;
 org.apache.spark.sql.AnalysisException: expression 'kEy' is neither present 
 in the group by, nor is it an aggregate function. Add to group by or wrap in 
 first() if you don't care which value you get.;
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:39)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:85)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$4.apply(CheckAnalysis.scala:101)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$4.apply(CheckAnalysis.scala:101)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:101)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:89)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:39)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1121)
   at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133)
   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:97)
   at 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply$mcV$sp(SQLQuerySuite.scala:408)
   at 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply(SQLQuerySuite.scala:406)
   at 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply(SQLQuerySuite.scala:406)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6258) Python MLlib API missing items: Clustering

2015-05-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-6258.
--
Resolution: Fixed

Issue resolved by pull request 6087
[https://github.com/apache/spark/pull/6087]

 Python MLlib API missing items: Clustering
 --

 Key: SPARK-6258
 URL: https://issues.apache.org/jira/browse/SPARK-6258
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k
 GaussianMixture
 * setInitialModel
 GaussianMixtureModel
 * k
 Completely missing items which should be fixed in separate JIRAs (which have 
 been created and linked to the umbrella JIRA)
 * LDA
 * PowerIterationClustering
 * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API

2015-05-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545017#comment-14545017
 ] 

Apache Spark commented on SPARK-7654:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6175

 DataFrameReader and DataFrameWriter for input/output API
 

 Key: SPARK-7654
 URL: https://issues.apache.org/jira/browse/SPARK-7654
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 We have a proliferation of save options now. It'd make more sense to have a 
 builder pattern for write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7661) Support for dynamic allocation of executors in Kinesis Spark Streaming

2015-05-15 Thread Murtaza Kanchwala (JIRA)
Murtaza Kanchwala created SPARK-7661:


 Summary: Support for dynamic allocation of executors in Kinesis 
Spark Streaming
 Key: SPARK-7661
 URL: https://issues.apache.org/jira/browse/SPARK-7661
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.3.1
 Environment: AWS-EMR
Reporter: Murtaza Kanchwala


Currently the logic for the no. of executors is (N + 1), where N is no. of 
shards in a Kinesis Stream.

My Requirement is that if I use this Resharding util for Amazon Kinesis :

Amazon Kinesis Resharding : 
https://github.com/awslabs/amazon-kinesis-scaling-utils

Then there should be some way to allocate executors on the basis of no. of 
shards directly (for Spark Streaming only).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7654:
---

Assignee: Apache Spark  (was: Reynold Xin)

 DataFrameReader and DataFrameWriter for input/output API
 

 Key: SPARK-7654
 URL: https://issues.apache.org/jira/browse/SPARK-7654
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 We have a proliferation of save options now. It'd make more sense to have a 
 builder pattern for write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7654:
---

Assignee: Reynold Xin  (was: Apache Spark)

 DataFrameReader and DataFrameWriter for input/output API
 

 Key: SPARK-7654
 URL: https://issues.apache.org/jira/browse/SPARK-7654
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 We have a proliferation of save options now. It'd make more sense to have a 
 builder pattern for write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7651:
---

Assignee: Apache Spark

 PySpark GMM predict, predictSoft should fail on bad input
 -

 Key: SPARK-7651
 URL: https://issues.apache.org/jira/browse/SPARK-7651
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Minor

 In PySpark, GaussianMixtureModel predict and predictSoft test if the argument 
 is an RDD and operate correctly if so.  But if the argument is not an RDD, 
 they fail silently, returning nothing.
 [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176]
 Instead, they should raise errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7651:
---

Assignee: (was: Apache Spark)

 PySpark GMM predict, predictSoft should fail on bad input
 -

 Key: SPARK-7651
 URL: https://issues.apache.org/jira/browse/SPARK-7651
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Joseph K. Bradley
Priority: Minor

 In PySpark, GaussianMixtureModel predict and predictSoft test if the argument 
 is an RDD and operate correctly if so.  But if the argument is not an RDD, 
 they fail silently, returning nothing.
 [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176]
 Instead, they should raise errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input

2015-05-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545126#comment-14545126
 ] 

Apache Spark commented on SPARK-7651:
-

User 'FlytxtRnD' has created a pull request for this issue:
https://github.com/apache/spark/pull/6180

 PySpark GMM predict, predictSoft should fail on bad input
 -

 Key: SPARK-7651
 URL: https://issues.apache.org/jira/browse/SPARK-7651
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Joseph K. Bradley
Priority: Minor

 In PySpark, GaussianMixtureModel predict and predictSoft test if the argument 
 is an RDD and operate correctly if so.  But if the argument is not an RDD, 
 they fail silently, returning nothing.
 [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176]
 Instead, they should raise errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7586) User guide update for spark.ml Word2Vec

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7586:
---

Assignee: Apache Spark  (was: Xusen Yin)

 User guide update for spark.ml Word2Vec
 ---

 Key: SPARK-7586
 URL: https://issues.apache.org/jira/browse/SPARK-7586
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Assignee: Apache Spark

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}
 Note: I created a new subsection for links to spark.ml-specific guides in 
 this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
 subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7586) User guide update for spark.ml Word2Vec

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7586:
---

Assignee: Xusen Yin  (was: Apache Spark)

 User guide update for spark.ml Word2Vec
 ---

 Key: SPARK-7586
 URL: https://issues.apache.org/jira/browse/SPARK-7586
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Assignee: Xusen Yin

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}
 Note: I created a new subsection for links to spark.ml-specific guides in 
 this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
 subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7586) User guide update for spark.ml Word2Vec

2015-05-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545153#comment-14545153
 ] 

Apache Spark commented on SPARK-7586:
-

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6181

 User guide update for spark.ml Word2Vec
 ---

 Key: SPARK-7586
 URL: https://issues.apache.org/jira/browse/SPARK-7586
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Assignee: Xusen Yin

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}
 Note: I created a new subsection for links to spark.ml-specific guides in 
 this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
 subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7663) [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7663:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

Yeah it should be an error in any event. It's just a question of whether you 
want a different error. You could `require` a non-empty iterator with an 
appropriate error message instead.

 [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size 
 is zero
 -

 Key: SPARK-7663
 URL: https://issues.apache.org/jira/browse/SPARK-7663
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xusen Yin
Priority: Minor
 Fix For: 1.4.1


 mllib.feature.Word2Vec at line 442: 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L442
  uses `.head` to get the vector size. But it would throw an empty iterator 
 error if the `minCount` is large enough to remove all words in the dataset.
 But due to this is not a common scenario, so maybe we can ignore it. If so, 
 we can close the issue directly. If not, I can add some code to print more 
 elegant error hits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7566) HiveContext.analyzer cannot be overriden

2015-05-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545066#comment-14545066
 ] 

Apache Spark commented on SPARK-7566:
-

User 'smola' has created a pull request for this issue:
https://github.com/apache/spark/pull/6177

 HiveContext.analyzer cannot be overriden
 

 Key: SPARK-7566
 URL: https://issues.apache.org/jira/browse/SPARK-7566
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Santiago M. Mola
Assignee: Santiago M. Mola
 Fix For: 1.4.0


 Trying to override HiveContext.analyzer will give the following compilation 
 error:
 {code}
 Error:(51, 36) overriding lazy value analyzer in class HiveContext of type 
 org.apache.spark.sql.catalyst.analysis.Analyzer{val extendedResolutionRules: 
 List[org.apache.spark.sql.catalyst.rules.Rule[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]]};
  lazy value analyzer has incompatible type
   override protected[sql] lazy val analyzer: Analyzer = {
^
 {code}
 That is because the type changed inadvertedly when omitting the type 
 declaration of the return type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7662) Exception of multi-attribute generator anlysis in projection

2015-05-15 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-7662:


 Summary: Exception of multi-attribute generator anlysis in 
projection
 Key: SPARK-7662
 URL: https://issues.apache.org/jira/browse/SPARK-7662
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker


{code}
select explode(map(value, key)) from src;
{code}

It throws exception like
{panel}
org.apache.spark.sql.AnalysisException: The number of aliases supplied in the 
AS clause does not match the number of columns output by the UDTF expected 2 
aliases but got _c0 ;
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
{panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7663) [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7663:
-
Fix Version/s: (was: 1.4.1)

(Don't set Fix Version please: 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark )

 [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size 
 is zero
 -

 Key: SPARK-7663
 URL: https://issues.apache.org/jira/browse/SPARK-7663
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xusen Yin
Priority: Minor

 mllib.feature.Word2Vec at line 442: 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L442
  uses `.head` to get the vector size. But it would throw an empty iterator 
 error if the `minCount` is large enough to remove all words in the dataset.
 But due to this is not a common scenario, so maybe we can ignore it. If so, 
 we can close the issue directly. If not, I can add some code to print more 
 elegant error hits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545006#comment-14545006
 ] 

Josh Rosen commented on SPARK-4105:
---

I've opened SPARK-7660 to track progress on the fix for the snappy-java buffer 
sharing bug.

 FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
 shuffle
 -

 Key: SPARK-4105
 URL: https://issues.apache.org/jira/browse/SPARK-4105
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker
 Attachments: JavaObjectToSerialize.java, 
 SparkFailedToUncompressGenerator.scala


 We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
 shuffle read.  Here's a sample stacktrace from an executor:
 {code}
 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
 33053)
 java.io.IOException: FAILED_TO_UNCOMPRESS(5)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
   at 
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
   at 
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
   at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
   at 
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
   at 
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 Here's another occurrence of a similar error:
 {code}
 

[jira] [Updated] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API

2015-05-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7654:
---
Summary: DataFrameReader and DataFrameWriter for input/output API  (was: 
Create builder pattern for DataFrame.save)

 DataFrameReader and DataFrameWriter for input/output API
 

 Key: SPARK-7654
 URL: https://issues.apache.org/jira/browse/SPARK-7654
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 We have a proliferation of save options now. It'd make more sense to have a 
 builder pattern instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API

2015-05-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7654:
---
Description: 
We have a proliferation of save options now. It'd make more sense to have a 
builder pattern for write.




  was:
We have a proliferation of save options now. It'd make more sense to have a 
builder pattern instead.



 DataFrameReader and DataFrameWriter for input/output API
 

 Key: SPARK-7654
 URL: https://issues.apache.org/jira/browse/SPARK-7654
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 We have a proliferation of save options now. It'd make more sense to have a 
 builder pattern for write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7660:
---

Assignee: Apache Spark

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Assignee: Apache Spark
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice (see 
 https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
 details).
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545031#comment-14545031
 ] 

Apache Spark commented on SPARK-7660:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6176

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice (see 
 https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
 details).
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7660:
---

Assignee: (was: Apache Spark)

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice (see 
 https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
 details).
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545014#comment-14545014
 ] 

Josh Rosen commented on SPARK-7660:
---

If we're wary of upgrading to a new Snappy version and don't want to wait for a 
new release / backport, one option is to just wrap SnappyOutputStream with our 
own code to make close() idempotent.  I don't think that this will have any 
significant overhead if done right, since the JIT should be able to inline the 
SnappyOutputStream calls.

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice (see 
 https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
 details).
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7663) [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero

2015-05-15 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-7663:


 Summary: [MLLIB] feature.Word2Vec throws empty iterator error when 
the vocabulary size is zero
 Key: SPARK-7663
 URL: https://issues.apache.org/jira/browse/SPARK-7663
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xusen Yin
 Fix For: 1.4.1


mllib.feature.Word2Vec at line 442: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L442
 uses `.head` to get the vector size. But it would throw an empty iterator 
error if the `minCount` is large enough to remove all words in the dataset.

But due to this is not a common scenario, so maybe we can ignore it. If so, we 
can close the issue directly. If not, I can add some code to print more elegant 
error hits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7227) Support fillna / dropna in R DataFrame

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7227:
---

Assignee: Apache Spark  (was: Sun Rui)

 Support fillna / dropna in R DataFrame
 --

 Key: SPARK-7227
 URL: https://issues.apache.org/jira/browse/SPARK-7227
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Reynold Xin
Assignee: Apache Spark
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7227) Support fillna / dropna in R DataFrame

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7227:
---

Assignee: Sun Rui  (was: Apache Spark)

 Support fillna / dropna in R DataFrame
 --

 Key: SPARK-7227
 URL: https://issues.apache.org/jira/browse/SPARK-7227
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Reynold Xin
Assignee: Sun Rui
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7227) Support fillna / dropna in R DataFrame

2015-05-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545300#comment-14545300
 ] 

Apache Spark commented on SPARK-7227:
-

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/6183

 Support fillna / dropna in R DataFrame
 --

 Key: SPARK-7227
 URL: https://issues.apache.org/jira/browse/SPARK-7227
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Reynold Xin
Assignee: Sun Rui
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7657) [YARN] Show driver link in Spark UI

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7657:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

 [YARN] Show driver link in Spark UI
 ---

 Key: SPARK-7657
 URL: https://issues.apache.org/jira/browse/SPARK-7657
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.4.0
Reporter: Hari Shreedharan
Priority: Minor

 Currently, the driver link does not show up in the application UI. It is 
 painful to debug apps running in cluster mode if the link does not show up. 
 Client mode is fine since the links are local to the client machine.
 In YARN mode, it is possible to just get this from the YARN container report. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6499) pyspark: printSchema command on a dataframe hangs

2015-05-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545355#comment-14545355
 ] 

Sean Owen commented on SPARK-6499:
--

I can't reproduce this. Are you sure it still happens? what version, master?

 pyspark: printSchema command on a dataframe hangs
 -

 Key: SPARK-6499
 URL: https://issues.apache.org/jira/browse/SPARK-6499
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: cynepia
 Attachments: airports.json, pyspark.txt


 1. A printSchema() on a dataframe fails to respond even after a lot of time
 Will attach the console logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6399) Code compiled against 1.3.0 may not run against older Spark versions

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6399:
-
Issue Type: Improvement  (was: Bug)

 Code compiled against 1.3.0 may not run against older Spark versions
 

 Key: SPARK-6399
 URL: https://issues.apache.org/jira/browse/SPARK-6399
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Affects Versions: 1.3.0
Reporter: Marcelo Vanzin

 Commit 65b987c3 re-organized the implicit conversions of RDDs so that they're 
 easier to use. The problem is that scalac now generates code that will not 
 run on older Spark versions if those conversions are used.
 Basically, even if you explicitly import {{SparkContext._}}, scalac will 
 generate references to the new methods in the {{RDD}} object instead. So the 
 compiled code will reference code that doesn't exist in older versions of 
 Spark.
 You can work around this by explicitly calling the methods in the 
 {{SparkContext}} object, although that's a little ugly.
 We should at least document this limitation (if there's no way to fix it), 
 since I believe forwards compatibility in the API was also a goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6287) Add support for dynamic allocation in the Mesos coarse-grained scheduler

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6287:
-
Issue Type: Improvement  (was: Bug)

 Add support for dynamic allocation in the Mesos coarse-grained scheduler
 

 Key: SPARK-6287
 URL: https://issues.apache.org/jira/browse/SPARK-6287
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Iulian Dragos

 Add support inside the coarse-grained Mesos scheduler for dynamic allocation. 
 It amounts to implementing two methods that allow scaling up and down the 
 number of executors:
 {code}
 def doKillExecutors(executorIds: Seq[String])
 def doRequestTotalExecutors(requestedTotal: Int)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7336) Sometimes the status of finished job show on JobHistory UI will be active, and never update.

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7336:
-
Priority: Minor  (was: Major)

 Sometimes the status of finished job show on JobHistory UI will be active, 
 and never update.
 

 Key: SPARK-7336
 URL: https://issues.apache.org/jira/browse/SPARK-7336
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: ShaoChuan
Priority: Minor

 When I run a SparkPi job, the status of the job on JobHistory UI was 
 'active'. After the job finished for a long time, the status on JobHistory UI 
 never update again, and the job keep in the 'Incomplete applications' list. 
 This problem appears occasionally. And the configuration of JobHistory is 
 default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6520) Kyro serialization broken in the shell

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6520.
--
Resolution: Won't Fix

Yes, I think this is a function of how {{:paste}}d code is evaluated and how 
that interacts with what Kryo expects. I don't know that it's realistic to 
expect that changes; spark-shell is just quite different in how classes are 
defined on the fly. You can run a compiled program and you can separately paste 
your class definitions first if you had to.

 Kyro serialization broken in the shell
 --

 Key: SPARK-6520
 URL: https://issues.apache.org/jira/browse/SPARK-6520
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.0
Reporter: Aaron Defazio

 If I start spark as follows:
 {quote}
 ~/spark-1.3.0-bin-hadoop2.4/bin/spark-shell --master local[1] --conf 
 spark.serializer=org.apache.spark.serializer.KryoSerializer
 {quote}
 Then using :paste, run 
 {quote}
 case class Example(foo : String, bar : String)
 val ex = sc.parallelize(List(Example(foo1, bar1), Example(foo2, 
 bar2))).collect()
 {quote}
 I get the error:
 {quote}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): java.io.IOException: 
 com.esotericsoftware.kryo.KryoException: Error constructing instance of 
 class: $line3.$read
 Serialization trace:
 $VAL10 ($iwC)
 $outer ($iwC$$iwC)
 $outer ($iwC$$iwC$Example)
   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1140)
   at 
 org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:979)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1873)
   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1970)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1895)
   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 {quote}
 As far as I can tell, when using :paste, Kyro serialization doesn't work for 
 classes defined in within the same paste. It does work when the statements 
 are entered without paste.
 This issue seems serious to me, since Kyro serialization is virtually 
 mandatory for performance (20x slower with default serialization on my 
 problem), and I'm assuming feature parity between spark-shell and 
 spark-submit is a goal.
 Note that this is different from SPARK-6497, which covers the case when Kyro 
 is set to require registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5711) Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5711.
--
Resolution: Not A Problem

I'm not sure this qualifies as a bug. You're just saying that processing a lot 
of data took a long time, and that time was spent somewhere. If you have a 
specific suggestion about how to set the size of this map more intelligently to 
avoid growing/rehashing, we can reopen.

 Sort Shuffle performance issues about using AppendOnlyMap for large data sets
 -

 Key: SPARK-5711
 URL: https://issues.apache.org/jira/browse/SPARK-5711
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: hbase-0.98.6-cdh5.2.0 phoenix-4.2.2
Reporter: Sun Fulin

 Recently we had caught performance issues when using spark 1.2.0 to read data 
 from hbase and do some summary work.
 Our scenario means to : read large data sets from hbase (maybe 100G+ file) , 
 form hbaseRDD, transform to schemardd, 
 groupby and aggregate the data while got fewer new summary data sets, loading 
 data into hbase (phoenix).
 Our major issue lead to : aggregate large datasets to get summary data sets 
 would consume too long time (1 hour +) , while that
 should be supposed not so bad performance. We got the dump file attached and 
 stacktrace from jstack like the following:
 From the stacktrace and dump file we can identify that processing large 
 datasets would cause frequent AppendOnlyMap growing, and 
 leading to huge map entrysize. We had referenced the source code of 
 org.apache.spark.util.collection.AppendOnlyMap and found that 
 the map had been initialized with capacity of 64. That would be too small for 
 our use case. 
 Thread 22432: (state = IN_JAVA)
 - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, 
 line=224 (Compiled frame; information may be imprecise)
 - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() 
 @bci=1, line=38 (Interpreted frame)
 - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, 
 line=198 (Compiled frame)
 - 
 org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, 
 scala.Function2) @bci=201, line=145 (Compiled frame)
 - 
 org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
  scala.Function2) @bci=3, line=32 (Compiled frame)
 - 
 org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
  @bci=141, line=205 (Compiled frame)
 - 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
  @bci=74, line=58 (Interpreted frame)
 - 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
  @bci=169, line=68 (Interpreted frame)
 - 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
  @bci=2, line=41 (Interpreted frame)
 - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted 
 frame)
 - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 
 (Interpreted frame)
 - 
 java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
  @bci=95, line=1145 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)
 Thread 22431: (state = IN_JAVA)
 - org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, 
 line=224 (Compiled frame; information may be imprecise)
 - org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() 
 @bci=1, line=38 (Interpreted frame)
 - org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, 
 line=198 (Compiled frame)
 - 
 org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, 
 scala.Function2) @bci=201, line=145 (Compiled frame)
 - 
 org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
  scala.Function2) @bci=3, line=32 (Compiled frame)
 - 
 org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
  @bci=141, line=205 (Compiled frame)
 - 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
  @bci=74, line=58 (Interpreted frame)
 - 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
  @bci=169, line=68 (Interpreted frame)
 - 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext)
  @bci=2, line=41 (Interpreted frame)
 - org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted 
 frame)
 - org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 
 (Interpreted frame)
 - 
 

[jira] [Commented] (SPARK-5412) Cannot bind Master to a specific hostname as per the documentation

2015-05-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545371#comment-14545371
 ] 

Sean Owen commented on SPARK-5412:
--

A-ha. I think the issue is that additional args to {{start-master.sh}} aren't 
passed through to {{Master}} with $@. I think they are intended to be, as the 
same thing is done in {{start-slave.sh}} for example. Let me look a little more 
and open a PR if it seems like the right thing to do.

 Cannot bind Master to a specific hostname as per the documentation
 --

 Key: SPARK-5412
 URL: https://issues.apache.org/jira/browse/SPARK-5412
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.2.0
Reporter: Alexis Seigneurin

 Documentation on http://spark.apache.org/docs/latest/spark-standalone.html 
 indicates:
 {quote}
 You can start a standalone master server by executing:
 ./sbin/start-master.sh
 ...
 the following configuration options can be passed to the master and worker:
 ...
 -h HOST, --host HOST  Hostname to listen on
 {quote}
 The \-h or --host parameter actually doesn't work with the 
 start-master.sh script. Instead, one has to set the SPARK_MASTER_IP 
 variable prior to executing the script.
 Either the script or the documentation should be updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7664) DAG visualization: Fix incorrect link paths of DAG.

2015-05-15 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-7664:
--
Summary: DAG visualization: Fix incorrect link paths of DAG.  (was: Fix 
incorrect link paths of DAG.)

 DAG visualization: Fix incorrect link paths of DAG.
 ---

 Key: SPARK-7664
 URL: https://issues.apache.org/jira/browse/SPARK-7664
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
Priority: Minor

 In JobPage, we can jump a StagePage when we click corresponding box of DAG 
 viz but the link path is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7631) treenode argString should not print children

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7631:
-
Priority: Minor  (was: Major)

 treenode argString should not print children
 

 Key: SPARK-7631
 URL: https://issues.apache.org/jira/browse/SPARK-7631
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang
Priority: Minor

 spark-sql explain extended   
   
   select * from (
   
   select key from src union all  
   
   select key from src) t;
 the spark plan will print children in argString

 == Physical Plan ==
 Union[ HiveTableScan [key#1], (MetastoreRelation default, src, None), None,
  HiveTableScan [key#3], (MetastoreRelation default, src, None), None]
  HiveTableScan [key#1], (MetastoreRelation default, src, None), None
  HiveTableScan [key#3], (MetastoreRelation default, src, None), None



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7536) Audit MLlib Python API for 1.4

2015-05-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-7536:
---
Description: 
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala  Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc. SPARK-7666
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release. SPARK-7665
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python.
** classification
*** StreamingLogisticRegressionWithSGD SPARK-7633
** clustering
*** GaussianMixture SPARK-6258
*** LDA SPARK-6259
*** Power Iteration Clustering SPARK-5962
*** StreamingKMeans SPARK-4118 
** evaluation
*** MultilabelMetrics SPARK-6094 
** feature
*** ElementwiseProduct SPARK-7605
*** PCA SPARK-7604
** linalg
*** Distributed linear algebra SPARK-6100
** pmml.export SPARK-7638
** regression
*** StreamingLinearRegressionWithSGD SPARK-4127
** stat
*** KernelDensity SPARK-7639
** util
*** MLUtils SPARK-6263 

  was:
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala  Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release. SPARK-7665
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python.
** classification
*** StreamingLogisticRegressionWithSGD SPARK-7633
** clustering
*** GaussianMixture SPARK-6258
*** LDA SPARK-6259
*** Power Iteration Clustering SPARK-5962
*** StreamingKMeans SPARK-4118 
** evaluation
*** MultilabelMetrics SPARK-6094 
** feature
*** ElementwiseProduct SPARK-7605
*** PCA SPARK-7604
** linalg
*** Distributed linear algebra SPARK-6100
** pmml.export SPARK-7638
** regression
*** StreamingLinearRegressionWithSGD SPARK-4127
** stat
*** KernelDensity SPARK-7639
** util
*** MLUtils SPARK-6263 


 Audit MLlib Python API for 1.4
 --

 Key: SPARK-7536
 URL: https://issues.apache.org/jira/browse/SPARK-7536
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang

 For new public APIs added to MLlib, we need to check the generated HTML doc 
 and compare the Scala  Python versions.  We need to track:
 * Inconsistency: Do class/method/parameter names match?
 * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
 be as complete as the Scala doc. SPARK-7666
 * API breaking changes: These should be very rare but are occasionally either 
 necessary (intentional) or accidental.  These must be recorded and added in 
 the Migration Guide for this release. SPARK-7665
 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
 component, please note that as well.
 * Missing classes/methods/parameters: We should create to-do JIRAs for 
 functionality missing from Python.
 ** classification
 *** StreamingLogisticRegressionWithSGD SPARK-7633
 ** clustering
 *** GaussianMixture SPARK-6258
 *** LDA SPARK-6259
 *** Power Iteration Clustering SPARK-5962
 *** StreamingKMeans SPARK-4118 
 ** evaluation
 *** MultilabelMetrics SPARK-6094 
 ** feature
 *** ElementwiseProduct SPARK-7605
 *** PCA SPARK-7604
 ** linalg
 *** Distributed linear algebra SPARK-6100
 ** pmml.export SPARK-7638
 ** regression
 *** StreamingLinearRegressionWithSGD SPARK-4127
 ** stat
 *** KernelDensity SPARK-7639
 ** util
 *** MLUtils SPARK-6263 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7666) MLlib Python doc parity check

2015-05-15 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-7666:
--

 Summary: MLlib Python doc parity check
 Key: SPARK-7666
 URL: https://issues.apache.org/jira/browse/SPARK-7666
 Project: Spark
  Issue Type: Documentation
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang


Check then make the MLlib Python doc to be as complete as the Scala doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6399) Code compiled against 1.3.0 may not run against older Spark versions

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6399:
-
Priority: Minor  (was: Major)

 Code compiled against 1.3.0 may not run against older Spark versions
 

 Key: SPARK-6399
 URL: https://issues.apache.org/jira/browse/SPARK-6399
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Affects Versions: 1.3.0
Reporter: Marcelo Vanzin
Priority: Minor

 Commit 65b987c3 re-organized the implicit conversions of RDDs so that they're 
 easier to use. The problem is that scalac now generates code that will not 
 run on older Spark versions if those conversions are used.
 Basically, even if you explicitly import {{SparkContext._}}, scalac will 
 generate references to the new methods in the {{RDD}} object instead. So the 
 compiled code will reference code that doesn't exist in older versions of 
 Spark.
 You can work around this by explicitly calling the methods in the 
 {{SparkContext}} object, although that's a little ugly.
 We should at least document this limitation (if there's no way to fix it), 
 since I believe forwards compatibility in the API was also a goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6355) Spark standalone cluster does not support local:/ url for jar file

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6355:
-
Priority: Minor  (was: Major)

 Spark standalone cluster does not support local:/ url for jar file
 --

 Key: SPARK-6355
 URL: https://issues.apache.org/jira/browse/SPARK-6355
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1, 1.3.0
Reporter: Jesper Lundgren
Priority: Minor

 Submitting a new spark application to a standalone cluster with local:/path 
 will result in an exception.
 Driver successfully submitted as driver-20150316171157-0004
 ... waiting before polling master for driver state
 ... polling master for driver state
 State of driver-20150316171157-0004 is ERROR
 Exception from cluster was: java.io.IOException: No FileSystem for scheme: 
 local
 java.io.IOException: No FileSystem for scheme: local
   at 
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
   at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
 org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:141)
   at 
 org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6415) Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills the app

2015-05-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545356#comment-14545356
 ] 

Sean Owen commented on SPARK-6415:
--

Sort of related to https://issues.apache.org/jira/browse/SPARK-4545

You don't really want the whole streaming system to stop if one batch fails 
though, right? I can see wanting to stop it if every batch will fail, though 
that's harder to know.

 Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills 
 the app
 -

 Key: SPARK-6415
 URL: https://issues.apache.org/jira/browse/SPARK-6415
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Of course, this would have to be done as a configurable param, but such a 
 fail-fast is useful else it is painful to figure out what is happening when 
 there are cascading failures. In some cases, the SparkContext shuts down and 
 streaming keeps scheduling jobs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6415) Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills the app

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6415:
-
Issue Type: Improvement  (was: Bug)

 Spark Streaming fail-fast: Stop scheduling jobs when a batch fails, and kills 
 the app
 -

 Key: SPARK-6415
 URL: https://issues.apache.org/jira/browse/SPARK-6415
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Hari Shreedharan

 Of course, this would have to be done as a configurable param, but such a 
 fail-fast is useful else it is painful to figure out what is happening when 
 there are cascading failures. In some cases, the SparkContext shuts down and 
 streaming keeps scheduling jobs 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6035) Unable to launch spark stream driver in cluster mode

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6035.
--
Resolution: Not A Problem

This looks like a problem specific to your setup on EC2. Something failed to 
startup.

 Unable to launch spark stream driver in cluster mode
 

 Key: SPARK-6035
 URL: https://issues.apache.org/jira/browse/SPARK-6035
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.1
Reporter: pankaj

 Spark version : spark-1.2.1-bin-hadoop2.4
 Geting error while launching driver  from master node in cluster mode and it 
 gets launched on one of slave node in cluster.
 Issue: while launching from master it launches driver on one of slave and 
 tries to connect to the Master on 0 port. in this case i am launching it from 
 master even if i tries to launch it from some other slave node it tries to 
 connect to the slave node from where i launched it.  
 Exception
 2015-02-26 07:36:05 INFO  SecurityManager:59 - Changing view acls to: root
 2015-02-26 07:36:05 INFO  SecurityManager:59 - Changing modify acls to: root
 2015-02-26 07:36:05 INFO  SecurityManager:59 - SecurityManager: 
 authentication disabled; ui acls disabled; users with view permissions: 
 Set(root); users with modify permissions: Set(root)
 2015-02-26 07:36:05 DEBUG AkkaUtils:63 - In createActorSystem, requireCookie 
 is: off
 2015-02-26 07:36:06 INFO  Slf4jLogger:80 - Slf4jLogger started
 2015-02-26 07:36:06 ERROR NettyTransport:65 - failed to bind to ec-node 
 -where i run submit command-.compute-1.amazonaws.com/xx.xx.xx.xx:0, shutting 
 down Netty transport
 2015-02-26 07:36:06 ERROR Remoting:65 - Remoting error: [Startup failed] [
 akka.remote.RemoteTransportException: Startup failed
 at 
 akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:136)
 at akka.remote.Remoting.start(Remoting.scala:201)
 at 
 akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
 at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
 at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615)
 at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615)
 at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
 at 
 org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
 at 
 org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
 Thanks
 Pankaj



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6533) Allow using wildcard and other file pattern in Parquet DataSource

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6533:
-
Labels:   (was: backport-needed)

 Allow using wildcard and other file pattern in Parquet DataSource
 -

 Key: SPARK-6533
 URL: https://issues.apache.org/jira/browse/SPARK-6533
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Jianshi Huang
Priority: Critical

 By default, spark.sql.parquet.useDataSourceApi is set to true. And loading 
 parquet files using file pattern will throw errors.
 *\*Wildcard*
 {noformat}
 scala val qp = 
 sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0*)
 15/03/25 08:43:59 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 15/03/25 08:43:59 WARN hdfs.BlockReaderLocal: The short-circuit local reads 
 feature cannot be used because libhadoop cannot be loaded.
 java.io.FileNotFoundException: File does not exist: 
 hdfs://.../source=live/date=2014-06-0*
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
   at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:276)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388)
   at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
 {noformat}
 And
 *\[abc\]*
 {noformat}
 val qp = sqlContext.parquetFile(hdfs://.../source=live/date=2014-06-0[12])
 java.lang.IllegalArgumentException: Illegal character in path at index 74: 
 hdfs://.../source=live/date=2014-06-0[12]
   at java.net.URI.create(URI.java:859)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:268)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:267)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:267)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:388)
   at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:522)
   ... 49 elided
 Caused by: java.net.URISyntaxException: Illegal character in path at index 
 74: hdfs://.../source=live/date=2014-06-0[12]
   at java.net.URI$Parser.fail(URI.java:2829)
   at java.net.URI$Parser.checkChars(URI.java:3002)
   at java.net.URI$Parser.parseHierarchical(URI.java:3086)
   at java.net.URI$Parser.parse(URI.java:3034)
   at java.net.URI.init(URI.java:595)
   at java.net.URI.create(URI.java:857)
 {noformat}
 If spark.sql.parquet.useDataSourceApi is not enabled we cannot have partition 
 discovery, schema evolution etc, but being able to specify file pattern is 
 also very important to applications.
 Please add this important feature.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7476) Dynamic partitioning random behaviour

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7476.
--
Resolution: Invalid

I think this is at best a question for user@. I don't think this relates to 
dynamic partition discovery if that's what you mean, nor is it random.

 Dynamic partitioning random behaviour
 -

 Key: SPARK-7476
 URL: https://issues.apache.org/jira/browse/SPARK-7476
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: Spark SQL in standalone mode on CDH 5.3
Reporter: Eswara Reddy Adapa

 According documentation spark sql 1.2 supports dynamic partitioning.
 But, we see below
 Expected output - 
 !http://ibin.co/20zI4242Ur1h!
 Output seen – 
 !http://ibin.co/20zILW5LQ5nT!
 It is generating only one partition (on a random value each time)
 Query:
 USE miah_ga;
 SET hive.exec.dynamic.partition=true;
 SET hive.exec.dynamic.partition.mode = nonstrict;
 DROP TABLE sem_stg_tmp_part;
 CREATE TABLE sem_stg_tmp_part
 (chnl_nm string
 ,cmpgn_yr_nbr string
 ,cmpgn_qtr_nbr string
 ,actv_mo_nm string
 ,actv_wk_end_dt string
 ,actv_dt string
 ,seg_nm string
 ,bdgt_node_nm string
 ,ad_grp_nm string
 ,kywrd_txt string
 ,srch_engn_nm string
 ,engn_cmpgn_nm string
 ,publ_ctry_nm string
 ,publ_geo_cd string
 ,last_dest_url_txt string
 ,dvc_cat_nm string
 ,mdia_propty_nm string
 ,mdia_type_nm string
 ,audnc_nm string
 ,sub_aud_nm string
 ,prog_nm string
 ,org_intv_nm string
 ,sub_org_intv_nm string
 ,prd_nm string
 ,prim_engg_dest_nm string
 ,imprsn_cnt string
 ,click_cnt string
 ,tot_cost_amt string
 ,ctr_pct string
 ,cpc_amt string
 ,vist_cnt string
 ,paid_per_vist_cnt string
 ,paid_pir_vist_cnt string
 ,cost_per_intel_per_vist_amt string)
 partitioned by (ddate string)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 STORED AS TEXTFILE;
 INSERT overwrite TABLE sem_stg_tmp_part PARTITION (ddate)
 SELECT *, concat(substr(actv_dt,1,2),substr(actv_dt,4,2),substr(actv_dt,7,4)) 
 as ddate
 FROM miah_ga.sem_stg_tmp;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7665) MLlib Python API breaking changes check between 1.3 1.4

2015-05-15 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-7665:
--

 Summary: MLlib Python API breaking changes check between 1.3  1.4
 Key: SPARK-7665
 URL: https://issues.apache.org/jira/browse/SPARK-7665
 Project: Spark
  Issue Type: Documentation
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang


Comparing the MLlib Python APIs between 1.3 and 1.4, so we can note breaking 
changes. 
We'll need to note those changes (if any) in the user guide's Migration Guide 
section.
If the API change is for an Alpha/Experimental/DeveloperApi component, we need 
also note that as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7063) Update lz4 for Java 7 to avoid: when lz4 compression is used, it causes core dump

2015-05-15 Thread Tim Ellison (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545368#comment-14545368
 ] 

Tim Ellison commented on SPARK-7063:


I can confirm that this failure is no longer seen using LZ4 1.3.0 with IBM Java 
7+.

 Update lz4 for Java 7 to avoid: when lz4 compression is used, it causes core 
 dump
 -

 Key: SPARK-7063
 URL: https://issues.apache.org/jira/browse/SPARK-7063
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
 Environment: IBM JDK
Reporter: Jenny MA
Priority: Minor

 this issue is initially noticed by using IBM JDK, below please find the stack 
 track of this issue, caused by violating the rule in critical section. 
 #0 0x00314340f3cb in raise () from 
 /service/pmrs/45638/20/lib64/libpthread.so.0
 #1 0x7f795b0323be in j9dump_create () from 
 /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9prt27.so
 #2 0x7f795a88ba2a in doSystemDump () from 
 /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9dmp27.so
 #3 0x7f795b0405d5 in j9sig_protect () from 
 /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9prt27.so
 #4 0x7f795a88a1fd in runDumpFunction () from 
 /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9dmp27.so
 #5 0x7f795a88dbab in runDumpAgent () from 
 /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9dmp27.so
 #6 0x7f795a8a1c49 in triggerDumpAgents () from 
 /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9dmp27.so
 #7 0x7f795a4518fe in doTracePoint () from 
 /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9trc27.so
 #8 0x7f795a45210e in j9Trace () from 
 /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9trc27.so
 #9 0x7f79590e46e1 in 
 MM_StandardAccessBarrier::jniReleasePrimitiveArrayCritical(J9VMThread*, 
 _jarray*, void*, int) ()
 from 
 /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9gc27.so
 #10 0x7f7938bc397c in 
 Java_net_jpountz_lz4_LZ4JNI_LZ4_1compress_1limitedOutput () from 
 /service/pmrs/45638/20/tmp/liblz4-java7155003924599399415.so
 #11 0x7f795b707149 in VMprJavaSendNative () from 
 /service/pmrs/45638/20/opt/ibm/biginsights/jdk/jre/lib/amd64/compressedrefs/libj9vm27.so
 #12 0x in ?? ()
 this is an issue introduced by a bug in net.jpountz.lz4.lz4-1.2.0.jar, and 
 fixed in 1.3.0 version.  Sun JDK /Open JDK doesn't complain this issue, but 
 this issue will trigger assertion failure when IBM JDK is used. here is the 
 link to the fix 
 https://github.com/jpountz/lz4-java/commit/07229aa2f788229ab4f50379308297f428e3d2d2
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7664) DAG visualization: Fix incorrect link paths of DAG.

2015-05-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545305#comment-14545305
 ] 

Apache Spark commented on SPARK-7664:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/6184

 DAG visualization: Fix incorrect link paths of DAG.
 ---

 Key: SPARK-7664
 URL: https://issues.apache.org/jira/browse/SPARK-7664
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
Priority: Minor

 In JobPage, we can jump a StagePage when we click corresponding box of DAG 
 viz but the link path is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7664) DAG visualization: Fix incorrect link paths of DAG.

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7664:
---

Assignee: (was: Apache Spark)

 DAG visualization: Fix incorrect link paths of DAG.
 ---

 Key: SPARK-7664
 URL: https://issues.apache.org/jira/browse/SPARK-7664
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
Priority: Minor

 In JobPage, we can jump a StagePage when we click corresponding box of DAG 
 viz but the link path is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7664) DAG visualization: Fix incorrect link paths of DAG.

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7664:
---

Assignee: Apache Spark

 DAG visualization: Fix incorrect link paths of DAG.
 ---

 Key: SPARK-7664
 URL: https://issues.apache.org/jira/browse/SPARK-7664
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
Assignee: Apache Spark
Priority: Minor

 In JobPage, we can jump a StagePage when we click corresponding box of DAG 
 viz but the link path is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6973) The total stages on the allJobsPage is wrong

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6973:
-
Component/s: (was: Spark Core)
 Web UI

 The total stages on the allJobsPage is wrong
 

 Key: SPARK-6973
 URL: https://issues.apache.org/jira/browse/SPARK-6973
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: meiyoula
Priority: Minor
 Attachments: allJobs.png


 The job has two stages,  map and collect stage. Both two retried two times. 
 The first and second time of map stage is successful, and the third time 
 skipped. Of collect stage, the first and second time is failed, and the third 
 time is successful.
 On the allJobs page, the number of total stages is allStages-skippedStages. 
 Mostly it's wright, but here I think total stages should be 2.
 The example:
 Stage 0: Map Stage
 Stage 1: Collect Stage
 Stage: Stage 0 - Stage 1 - Stage 0(retry 1) - Stage 1(retry 1) - 
 Stage 0(retry 2) - Stage 1(retry 2)
 Status:  Success - Fail -Success   - Fail  
 -Skipped - Success
 Though one of Stage 0 is skipped, actually it's executed. So I think it 
 should be included in the total number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7603) Crash of thrift server when doing SQL without limit

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7603:
-
Component/s: (was: Spark Core)
 SQL

 Crash of thrift server when doing SQL without limit
 -

 Key: SPARK-7603
 URL: https://issues.apache.org/jira/browse/SPARK-7603
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
 Environment: Hortonworks Sandbox 2.1  with Spark 1.3.1
Reporter: Ihor Bobak

 I have 2 tables in hive: one with 120 thousand records, another one is 5 
 times smaller. 
 I'm running a standalone cluster on single VM, and the thrift server with 
 ./start-thriftserver.sh --conf spark.executor.memory=2048m  --conf 
 spark.driver.memory=1024m
 command. 
 My spark-defaults.conf contains:
 spark.master spark://sandbox.hortonworks.com:7077
 spark.eventLog.enabled   true
 spark.eventLog.dir   
 hdfs://sandbox.hortonworks.com:8020/user/pdi/spark/logs
 So, when I am running SQL 
 select some fields from header, some fields from details
 from  
   vw_salesorderdetail as d 
   left join vw_salesorderheader as h on h.SalesOrderID = d.SalesOrderID 
 limit 20;
 everything is fine, no matter that the limit is unreal (again: the resultset 
 returned is just 12 records).
 But if I am running the same query without limit clause - I get hanging of 
 execution - see here: http://postimg.org/image/fujdjd16f/42945a78/
 and a lot of exceptions in the logs of thrift server - here you are:
 15/05/13 17:59:27 INFO TaskSetManager: Starting task 158.0 in stage 48.0 (TID 
 953, sandbox.hortonworks.com, PROCESS_LOCAL, 1473 bytes)
 15/05/13 18:00:01 INFO TaskSetManager: Finished task 150.0 in stage 48.0 (TID 
 945) in 36166 ms on sandbox.hortonworks.com (152/200)
 15/05/13 18:00:02 ERROR Utils: Uncaught exception in thread Spark Context 
 Cleaner
 java.lang.OutOfMemoryError: GC overhead limit exceeded
   at 
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:147)
   at 
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144)
   at 
 org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:144)
   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
   at 
 org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:143)
   at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
 Exception in thread Spark Context Cleaner 15/05/13 18:00:02 ERROR Utils: 
 Uncaught exception in thread task-result-getter-1
 java.lang.OutOfMemoryError: GC overhead limit exceeded
   at java.lang.String.init(String.java:315)
   at com.esotericsoftware.kryo.io.Input.readAscii(Input.java:562)
   at com.esotericsoftware.kryo.io.Input.readString(Input.java:436)
   at 
 com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157)
   at 
 com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146)
   at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:706)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
   at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
   at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
   at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
   at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
   at 

[jira] [Updated] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7042:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

I think this is an Akka / Scala problem really, but can keep this open to track 
update Akka at some point.

 Spark version of akka-actor_2.11 is not compatible with the official 
 akka-actor_2.11 2.3.x
 --

 Key: SPARK-7042
 URL: https://issues.apache.org/jira/browse/SPARK-7042
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Konstantin Shaposhnikov
Priority: Minor

 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built 
 with Scala 2.11) from an application that uses akka 2.3.9 I get the following 
 error:
 {noformat}
 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] 
 [sparkDriver-akka.actor.default-dispatcher-5] -
 Association with remote system [akka.tcp://sparkExecutor@server:59007] has 
 failed, address is now gated for [5000] ms.
 Reason is: [akka.actor.Identify; local class incompatible: stream classdesc 
 serialVersionUID = -213377755528332889, local class serialVersionUID = 1].
 {noformat}
 It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been 
 built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations 
 (see https://issues.scala-lang.org/browse/SI-8549).
 The following steps can resolve the issue:
 - re-build the custom akka library that is used by Spark with the more recent 
 version of Scala compiler (e.g. 2.11.6) 
 - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo
 - update version of akka used by spark (master and 1.3 branch)
 I would also suggest to upgrade to the latest version of akka 2.3.9 (or 
 2.3.10 that should be released soon).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6973) The total stages on the allJobsPage is wrong

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6973:
-
Priority: Minor  (was: Major)

 The total stages on the allJobsPage is wrong
 

 Key: SPARK-6973
 URL: https://issues.apache.org/jira/browse/SPARK-6973
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: meiyoula
Priority: Minor
 Attachments: allJobs.png


 The job has two stages,  map and collect stage. Both two retried two times. 
 The first and second time of map stage is successful, and the third time 
 skipped. Of collect stage, the first and second time is failed, and the third 
 time is successful.
 On the allJobs page, the number of total stages is allStages-skippedStages. 
 Mostly it's wright, but here I think total stages should be 2.
 The example:
 Stage 0: Map Stage
 Stage 1: Collect Stage
 Stage: Stage 0 - Stage 1 - Stage 0(retry 1) - Stage 1(retry 1) - 
 Stage 0(retry 2) - Stage 1(retry 2)
 Status:  Success - Fail -Success   - Fail  
 -Skipped - Success
 Though one of Stage 0 is skipped, actually it's executed. So I think it 
 should be included in the total number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6056) Unlimit offHeap memory use cause RM killing the container

2015-05-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545362#comment-14545362
 ] 

Sean Owen commented on SPARK-6056:
--

I can't make out whether this is an issue or not. Do you just need to allow for 
more off-heap memory in YARN?

 Unlimit offHeap memory use cause RM killing the container
 -

 Key: SPARK-6056
 URL: https://issues.apache.org/jira/browse/SPARK-6056
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.1
Reporter: SaintBacchus

 No matter set the `preferDirectBufs` or limit the number of thread or not 
 ,spark can not limit the use of offheap memory.
 At line 269 of the class 'AbstractNioByteChannel' in netty-4.0.23.Final, 
 Netty had allocated a offheap memory buffer with the same size in heap.
 So how many buffer you want to transfor, the same size offheap memory will be 
 allocated.
 But once the allocated memory size reach the capacity of the overhead momery 
 set in yarn, this executor will be killed.
 I wrote a simple code to test it:
 {code:title=test.scala|borderStyle=solid}
 import org.apache.spark.storage._
 import org.apache.spark._
 val bufferRdd = sc.makeRDD(0 to 10, 10).map(x=new 
 Array[Byte](10*1024*1024)).persist
 bufferRdd.count
 val part =  bufferRdd.partitions(0)
 val sparkEnv = SparkEnv.get
 val blockMgr = sparkEnv.blockManager
 def test = {
 val blockOption = blockMgr.get(RDDBlockId(bufferRdd.id, part.index))
 val resultIt = 
 blockOption.get.data.asInstanceOf[Iterator[Array[Byte]]]
 val len = resultIt.map(_.length).sum
 println(s[${Thread.currentThread.getId}] get block length = $len)
 }
 def test_driver(count:Int, parallel:Int)(f: = Unit) = {
 val tpool = new scala.concurrent.forkjoin.ForkJoinPool(parallel)
 val taskSupport  = new 
 scala.collection.parallel.ForkJoinTaskSupport(tpool)
 val parseq = (1 to count).par
 parseq.tasksupport = taskSupport
 parseq.foreach(x=f)
 tpool.shutdown
 tpool.awaitTermination(100, java.util.concurrent.TimeUnit.SECONDS)
 }
 {code}
 progress:
 1. bin/spark-shell --master yarn-cilent --executor-cores 40 --num-executors 1
 2. :load test.scala in spark-shell
 3. use such comman to catch executor on slave node
 {code}
 pid=$(jps|grep CoarseGrainedExecutorBackend |awk '{print $1}');top -b -p 
 $pid|grep $pid
 {code}
 4. test_driver(20,100)(test) in spark-shell
 5. watch the output of the command on slave node
 If use multi-thread to get len, the physical memery will soon   exceed the 
 limit set by spark.yarn.executor.memoryOverhead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7503) Resources in .sparkStaging directory can't be cleaned up on error

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7503:
-
Assignee: Kousuke Saruta

 Resources in .sparkStaging directory can't be cleaned up on error
 -

 Key: SPARK-7503
 URL: https://issues.apache.org/jira/browse/SPARK-7503
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
 Fix For: 1.4.0


 When we run applications on YARN with cluster mode, uploaded resources on 
 .sparkStaging directory can't be cleaned up in case of failure of uploading 
 local resources.
 You can see this issue by running following command.
 {code}
 bin/spark-submit --master yarn --deploy-mode cluster --class someClassName 
 non-existing-jar
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7503) Resources in .sparkStaging directory can't be cleaned up on error

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7503.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6026
[https://github.com/apache/spark/pull/6026]

 Resources in .sparkStaging directory can't be cleaned up on error
 -

 Key: SPARK-7503
 URL: https://issues.apache.org/jira/browse/SPARK-7503
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
 Fix For: 1.4.0


 When we run applications on YARN with cluster mode, uploaded resources on 
 .sparkStaging directory can't be cleaned up in case of failure of uploading 
 local resources.
 You can see this issue by running following command.
 {code}
 bin/spark-submit --master yarn --deploy-mode cluster --class someClassName 
 non-existing-jar
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7664) Fix incorrect link paths of DAG.

2015-05-15 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-7664:
-

 Summary: Fix incorrect link paths of DAG.
 Key: SPARK-7664
 URL: https://issues.apache.org/jira/browse/SPARK-7664
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
Priority: Minor


In JobPage, we can jump a StagePage when we click corresponding box of DAG viz 
but the link path is incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7344) Spark hangs reading and writing to the same S3 bucket

2015-05-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545323#comment-14545323
 ] 

Sean Owen commented on SPARK-7344:
--

yes but the most recent script still runs with Hadoop 1.x code I think, and 
that has old S3 libs. I think this is an S3 client library problem, or at 
least, I would try a build with Hadoop 2.x bindings and later jets3t libs first.

 Spark hangs reading and writing to the same S3 bucket
 -

 Key: SPARK-7344
 URL: https://issues.apache.org/jira/browse/SPARK-7344
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1
 Environment: AWS EC2
Reporter: Daniel Mahler

 The following code will hang if the `outprefix` is in an S3 bucket
 def copy1 = s3n://mybucket/copy1
 def copy2 = s3n://mybucket/copy2
 val txt1 = sc.textFile(inpath)
 txt1.count
 val res = txt.saveAsTextFile(copy1)
 val txt2 = sc.textFile(copy1 +/part-*)
 txt2.count
 txt2.saveAsTextFile(copy2) // - HANGS HERE
 val txt3 = sc.textFile(copy2 +/part-*)
 txt3.count
 The problem goew away if copy1 and copy2 are in distinct S2 buckets or when 
 using HDFS instead of S3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7536) Audit MLlib Python API for 1.4

2015-05-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-7536:
---
Description: 
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala  Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match? SPARK-7667
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc. SPARK-7666
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release. SPARK-7665
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python.
** classification
*** StreamingLogisticRegressionWithSGD SPARK-7633
** clustering
*** GaussianMixture SPARK-6258
*** LDA SPARK-6259
*** Power Iteration Clustering SPARK-5962
*** StreamingKMeans SPARK-4118 
** evaluation
*** MultilabelMetrics SPARK-6094 
** feature
*** ElementwiseProduct SPARK-7605
*** PCA SPARK-7604
** linalg
*** Distributed linear algebra SPARK-6100
** pmml.export SPARK-7638
** regression
*** StreamingLinearRegressionWithSGD SPARK-4127
** stat
*** KernelDensity SPARK-7639
** util
*** MLUtils SPARK-6263 

  was:
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala  Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc. SPARK-7666
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release. SPARK-7665
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python.
** classification
*** StreamingLogisticRegressionWithSGD SPARK-7633
** clustering
*** GaussianMixture SPARK-6258
*** LDA SPARK-6259
*** Power Iteration Clustering SPARK-5962
*** StreamingKMeans SPARK-4118 
** evaluation
*** MultilabelMetrics SPARK-6094 
** feature
*** ElementwiseProduct SPARK-7605
*** PCA SPARK-7604
** linalg
*** Distributed linear algebra SPARK-6100
** pmml.export SPARK-7638
** regression
*** StreamingLinearRegressionWithSGD SPARK-4127
** stat
*** KernelDensity SPARK-7639
** util
*** MLUtils SPARK-6263 


 Audit MLlib Python API for 1.4
 --

 Key: SPARK-7536
 URL: https://issues.apache.org/jira/browse/SPARK-7536
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang

 For new public APIs added to MLlib, we need to check the generated HTML doc 
 and compare the Scala  Python versions.  We need to track:
 * Inconsistency: Do class/method/parameter names match? SPARK-7667
 * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
 be as complete as the Scala doc. SPARK-7666
 * API breaking changes: These should be very rare but are occasionally either 
 necessary (intentional) or accidental.  These must be recorded and added in 
 the Migration Guide for this release. SPARK-7665
 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
 component, please note that as well.
 * Missing classes/methods/parameters: We should create to-do JIRAs for 
 functionality missing from Python.
 ** classification
 *** StreamingLogisticRegressionWithSGD SPARK-7633
 ** clustering
 *** GaussianMixture SPARK-6258
 *** LDA SPARK-6259
 *** Power Iteration Clustering SPARK-5962
 *** StreamingKMeans SPARK-4118 
 ** evaluation
 *** MultilabelMetrics SPARK-6094 
 ** feature
 *** ElementwiseProduct SPARK-7605
 *** PCA SPARK-7604
 ** linalg
 *** Distributed linear algebra SPARK-6100
 ** pmml.export SPARK-7638
 ** regression
 *** StreamingLinearRegressionWithSGD SPARK-4127
 ** stat
 *** KernelDensity SPARK-7639
 ** util
 *** MLUtils SPARK-6263 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6527) sc.binaryFiles can not access files on s3

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6527:
-
Component/s: EC2
   Priority: Minor  (was: Major)

Is there any more detail on this? like stack traces or the code you're running?

 sc.binaryFiles can not access files on s3
 -

 Key: SPARK-6527
 URL: https://issues.apache.org/jira/browse/SPARK-6527
 Project: Spark
  Issue Type: Bug
  Components: EC2, Input/Output
Affects Versions: 1.2.0, 1.3.0
 Environment: I am running Spark on EC2
Reporter: Zhao Zhang
Priority: Minor

 The sc.binaryFIles() can not access the files stored on s3. It can correctly 
 list the number of files, but report file does not exist when processing 
 them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x

2015-05-15 Thread Konstantin Shaposhnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545375#comment-14545375
 ] 

Konstantin Shaposhnikov commented on SPARK-7042:


There is nothing wrong with the standard Akka 2.11 build. In fact we have a 
custom build of Spark now that uses standard Akka 2.3.9 from maven central 
repository without any problems.

The error appears only with the custom build of akka (because it was compiled 
with buggy version of Scala) that comes with spark by default. 

I agree that number of users affected by this problem is probably quite small 
(only 1? ;)

 Spark version of akka-actor_2.11 is not compatible with the official 
 akka-actor_2.11 2.3.x
 --

 Key: SPARK-7042
 URL: https://issues.apache.org/jira/browse/SPARK-7042
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Konstantin Shaposhnikov
Priority: Minor

 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built 
 with Scala 2.11) from an application that uses akka 2.3.9 I get the following 
 error:
 {noformat}
 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] 
 [sparkDriver-akka.actor.default-dispatcher-5] -
 Association with remote system [akka.tcp://sparkExecutor@server:59007] has 
 failed, address is now gated for [5000] ms.
 Reason is: [akka.actor.Identify; local class incompatible: stream classdesc 
 serialVersionUID = -213377755528332889, local class serialVersionUID = 1].
 {noformat}
 It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been 
 built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations 
 (see https://issues.scala-lang.org/browse/SI-8549).
 The following steps can resolve the issue:
 - re-build the custom akka library that is used by Spark with the more recent 
 version of Scala compiler (e.g. 2.11.6) 
 - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo
 - update version of akka used by spark (master and 1.3 branch)
 I would also suggest to upgrade to the latest version of akka 2.3.9 (or 
 2.3.10 that should be released soon).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5271) PySpark History Web UI issues

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5271.
--
Resolution: Not A Problem

 PySpark History Web UI issues
 -

 Key: SPARK-5271
 URL: https://issues.apache.org/jira/browse/SPARK-5271
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Web UI
Affects Versions: 1.2.0
 Environment: PySpark 1.2.0 in yarn-client mode
Reporter: Andrey Zimovnov

 After successful run of PySpark app via spark-submit in yarn-client mode on 
 Hadoop 2.4 cluster the History UI shows the same as in issue SPARK-3898.
 {code}
 App Name:Not Started Started:1970/01/01 07:59:59 Spark User:Not Started
 Last Updated:2014/10/10 14:50:39
 Exception message:
 2014-10-10 14:51:14,284 - ERROR - 
 org.apache.spark.Logging$class.logError(Logging.scala:96) - 
 qtp1594785497-16851 -Exception in parsing Spark event log 
 hdfs://wscluster/sparklogs/24.3g_15_5g_2c-1412923684977/EVENT_LOG_1
 org.json4s.package$MappingException: Did not find value which can be 
 converted into int
 at org.json4s.reflect.package$.fail(package.scala:96)
 at org.json4s.Extraction$.convert(Extraction.scala:554)
 at org.json4s.Extraction$.extract(Extraction.scala:331)
 at org.json4s.Extraction$.extract(Extraction.scala:42)
 at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21)
 at 
 org.apache.spark.util.JsonProtocol$.blockManagerIdFromJson(JsonProtocol.scala:647)
 at 
 org.apache.spark.util.JsonProtocol$.blockManagerAddedFromJson(JsonProtocol.scala:468)
 at 
 org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:404)
 at 
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71)
 at 
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69)
 at 
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55)
 at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
 at 
 org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55)
 at 
 org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$loadAppInfo(FsHistoryProvider.scala:181)
 at 
 org.apache.spark.deploy.history.FsHistoryProvider.getAppUI(FsHistoryProvider.scala:99)
 at 
 org.apache.spark.deploy.history.HistoryServer$$anon$3.load(HistoryServer.scala:55)
 at 
 org.apache.spark.deploy.history.HistoryServer$$anon$3.load(HistoryServer.scala:53)
 at 
 com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
 at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
 at 
 com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
 at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
 at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
 at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
 at 
 com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
 at 
 org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:88)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
 at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
 at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
 at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
 at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:370)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
 at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at 
 

[jira] [Resolved] (SPARK-5265) Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5265.
--
Resolution: Duplicate

I think you described the same issue twice here; please close the old one if 
you're elaborating elsewhere.

 Submitting applications on Standalone cluster controlled by Zookeeper forces 
 to know active master
 --

 Key: SPARK-5265
 URL: https://issues.apache.org/jira/browse/SPARK-5265
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Roque Vassal'lo
  Labels: cluster, spark-submit, standalone, zookeeper

 Hi, this is my first JIRA here, so I hope it is clear enough.
 I'm using Spark 1.2.0 and trying to submit an application on a Spark 
 Standalone cluster in cluster deploy mode with supervise.
 Standalone cluster is running in high availability mode, using Zookeeper to 
 provide leader election between three available Masters (named master1, 
 master2 and master3).
 As read at Spark's documentation, to register a Worker to the Standalone 
 cluster, I provide complete cluster info as the spark route.
 I mean, spark://master1:7077,master2:7077,master3:7077
 and that route is parsed and three attempts are launched, first one to 
 master1:7077, second one to master2:7077 and third one to master3:7077.
 This works great!
 But if I try to do the same while submitting applications, it fails.
 I mean, if I provide complete cluster info as the --master option to 
 spark-submit script, it throws an exception because it tries to connect as it 
 was a single node.
 Example:
 spark-submit --class org.apache.spark.examples.SparkPi --master 
 spark://master1:7077,master2:7077,master3:7077 --deploy-mode cluster 
 --supervise examples.jar 100
 This is the output I got:
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 15/01/14 17:02:11 INFO SecurityManager: Changing view acls to: mytest
 15/01/14 17:02:11 INFO SecurityManager: Changing modify acls to: mytest
 15/01/14 17:02:11 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(mytest); users 
 with modify permissions: Set(mytest)
 15/01/14 17:02:11 INFO Slf4jLogger: Slf4jLogger started
 15/01/14 17:02:11 INFO Utils: Successfully started service 'driverClient' on 
 port 53930.
 15/01/14 17:02:11 ERROR OneForOneStrategy: Invalid master URL: 
 spark://master1:7077,master2:7077,master3:7077
 akka.actor.ActorInitializationException: exception during creation
   at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
   at akka.actor.ActorCell.create(ActorCell.scala:596)
   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: org.apache.spark.SparkException: Invalid master URL: 
 spark://master1:7077,master2:7077,master3:7077
   at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
   at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
   at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
   at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
   at akka.actor.ActorCell.create(ActorCell.scala:580)
   ... 9 more
 Shouldn't it parse it as on Worker registration?
 It will not force client to know which is the current active Master of the 
 Standalone cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5241) spark-ec2 spark init scripts do not handle all hadoop (or tachyon?) dependencies correctly

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5241.
--
Resolution: Invalid

I don't understand the problem being reported here. Reopen if you can suggest a 
particular change. Maybe start by asking on user@ ?

 spark-ec2 spark init scripts do not handle all hadoop (or tachyon?) 
 dependencies correctly
 --

 Key: SPARK-5241
 URL: https://issues.apache.org/jira/browse/SPARK-5241
 Project: Spark
  Issue Type: Bug
  Components: Build, EC2
Reporter: Florian Verhein

 spark-ec2/spark/init.sh doesn't completely adhere to hadoop dependencies. 
 This may also be an issue for tachyon dependencies. Related: tachyon appears 
 require builds against the right version of hadoop also (probably causes 
 this: SPARK-3185). 
 Applies to the spark build from git checkout in spark/init.sh (I suspect this 
 should also be changed to using mvn as that's the reference build according 
 to the docs?).
 May apply to pre-built spark in spark/init.sh as well, but I'm not sure about 
 this. E.g. I thought that the hadoop2.4 and cdh4.2 builds of spark are 
 different.
 Also note that hadoop native is built from hadoop 2.4.1 on the AMI, and this 
 is used regardless of HADOOP_MAJOR_VERSION in the *-hdfs modules.
 Tachyon is hard coded to 0.4.1 (which is probably built against hadoop1.x?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4808) Spark fails to spill with small number of large objects

2015-05-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545464#comment-14545464
 ] 

Sean Owen commented on SPARK-4808:
--

I think this is considered resolved now for 1.4 after 
https://github.com/apache/spark/commit/3be92cdac30cf488e09dbdaaa70e5c4cdaa9a099 
? but not 1.3.
Maybe [~andrewor14] can confirm.

 Spark fails to spill with small number of large objects
 ---

 Key: SPARK-4808
 URL: https://issues.apache.org/jira/browse/SPARK-4808
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1
Reporter: Dennis Lawler

 Spillable's maybeSpill does not allow spill to occur until at least 1000 
 elements have been spilled, and then will only evaluate spill every 32nd 
 element thereafter.  When there is a small number of very large items being 
 tracked, out-of-memory conditions may occur.
 I suspect that this and the every-32nd-element behavior was to reduce the 
 impact of the estimateSize() call.  This method was extracted into 
 SizeTracker, which implements its own exponential backup for size estimation, 
 so now we are only avoiding using the resulting estimated size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4560) Lambda deserialization error

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4560.
--
Resolution: Not A Problem

 Lambda deserialization error
 

 Key: SPARK-4560
 URL: https://issues.apache.org/jira/browse/SPARK-4560
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0, 1.1.1
 Environment: Java 8.0.25
Reporter: Alexis Seigneurin
 Attachments: IndexTweets.java, pom.xml


 I'm getting an error saying a lambda could not be deserialized. Here is the 
 code:
 {code}
 TwitterUtils.createStream(sc, twitterAuth, filters)
 .map(t - t.getText())
 .foreachRDD(tweets - {
 tweets.foreach(x - System.out.println(x));
 return null;
 });
 {code}
 Here is the exception:
 {noformat}
 java.io.IOException: unexpected exception type
   at 
 java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538)
   at 
 java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
   ... 27 more
 Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization
   at 
 com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1)
   ... 37 more
 {noformat}
 The weird thing is, if I write the following code (the map operation is 
 inside the foreachRDD), it works without problem.
 {code}
 TwitterUtils.createStream(sc, twitterAuth, filters)
 .foreachRDD(tweets - {
 tweets.map(t - t.getText())
 .foreach(x - System.out.println(x));
 return null;
 });
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (SPARK-4556) binary distribution assembly can't run in local mode

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4556:
---

Assignee: Apache Spark

 binary distribution assembly can't run in local mode
 

 Key: SPARK-4556
 URL: https://issues.apache.org/jira/browse/SPARK-4556
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Reporter: Sean Busbey
Assignee: Apache Spark

 After building the binary distribution assembly, the resultant tarball can't 
 be used for local mode.
 {code}
 busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
 [INFO] Scanning for projects...
 ...SNIP...
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO] 
 [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 
 s]
 [INFO] Spark Project Networking ... SUCCESS [ 31.402 
 s]
 [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 
 s]
 [INFO] Spark Project Core . SUCCESS [15:39 
 min]
 [INFO] Spark Project Bagel  SUCCESS [ 29.470 
 s]
 [INFO] Spark Project GraphX ... SUCCESS [05:20 
 min]
 [INFO] Spark Project Streaming  SUCCESS [11:02 
 min]
 [INFO] Spark Project Catalyst . SUCCESS [11:26 
 min]
 [INFO] Spark Project SQL .. SUCCESS [11:33 
 min]
 [INFO] Spark Project ML Library ... SUCCESS [14:27 
 min]
 [INFO] Spark Project Tools  SUCCESS [ 40.980 
 s]
 [INFO] Spark Project Hive . SUCCESS [11:45 
 min]
 [INFO] Spark Project REPL . SUCCESS [03:15 
 min]
 [INFO] Spark Project Assembly . SUCCESS [04:22 
 min]
 [INFO] Spark Project External Twitter . SUCCESS [ 43.567 
 s]
 [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 
 s]
 [INFO] Spark Project External Flume ... SUCCESS [01:41 
 min]
 [INFO] Spark Project External MQTT  SUCCESS [ 40.973 
 s]
 [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 
 s]
 [INFO] Spark Project External Kafka ... SUCCESS [01:23 
 min]
 [INFO] Spark Project Examples . SUCCESS [10:19 
 min]
 [INFO] 
 
 [INFO] BUILD SUCCESS
 [INFO] 
 
 [INFO] Total time: 01:47 h
 [INFO] Finished at: 2014-11-22T02:13:51-06:00
 [INFO] Final Memory: 79M/2759M
 [INFO] 
 
 busbey2-MBA:spark busbey$ cd assembly/target/
 busbey2-MBA:target busbey$ mkdir dist-temp
 busbey2-MBA:target busbey$ tar -C dist-temp -xzf 
 spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz 
 busbey2-MBA:target busbey$ cd dist-temp/
 busbey2-MBA:dist-temp busbey$ ./bin/spark-shell
 ls: 
 /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10:
  No such file or directory
 Failed to find Spark assembly in 
 /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10
 You need to build Spark before running this program.
 {code}
 It looks like the classpath calculations in {{bin/compute_classpath.sh}} 
 don't handle it.
 If I move all of the spark-*.jar files from the top level into the lib folder 
 and touch the RELEASE file, then the spark shell launches in local mode 
 normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3602) Can't run cassandra_inputformat.py

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3602.
--
Resolution: Not A Problem

I think this is due to mismatching Hadoop libs, or at least is stale enough at 
this point that I think it should be closed.

 Can't run cassandra_inputformat.py
 --

 Key: SPARK-3602
 URL: https://issues.apache.org/jira/browse/SPARK-3602
 Project: Spark
  Issue Type: Bug
  Components: Examples, PySpark
Affects Versions: 1.1.0
 Environment: Ubuntu 14.04
Reporter: Frens Jan Rumph

 When I execute:
 {noformat}
 wget 
 http://apache.cs.uu.nl/dist/spark/spark-1.1.0/spark-1.1.0-bin-hadoop2.4.tgz
 tar xzf spark-1.1.0-bin-hadoop2.4.tgz
 cd spark-1.1.0-bin-hadoop2.4/
 ./bin/spark-submit --jars lib/spark-examples-1.1.0-hadoop2.4.0.jar 
 examples/src/main/python/cassandra_inputformat.py localhost keyspace cf
 {noformat}
 The output is:
 {noformat}
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 14/09/19 10:41:10 WARN Utils: Your hostname, laptop-x resolves to a 
 loopback address: 127.0.0.1; using 192.168.2.2 instead (on interface wlan0)
 14/09/19 10:41:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 14/09/19 10:41:10 INFO SecurityManager: Changing view acls to: frens-jan,
 14/09/19 10:41:10 INFO SecurityManager: Changing modify acls to: frens-jan,
 14/09/19 10:41:10 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(frens-jan, ); 
 users with modify permissions: Set(frens-jan, )
 14/09/19 10:41:11 INFO Slf4jLogger: Slf4jLogger started
 14/09/19 10:41:11 INFO Remoting: Starting remoting
 14/09/19 10:41:11 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkDriver@laptop-x.local:43790]
 14/09/19 10:41:11 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkDriver@laptop-x.local:43790]
 14/09/19 10:41:11 INFO Utils: Successfully started service 'sparkDriver' on 
 port 43790.
 14/09/19 10:41:11 INFO SparkEnv: Registering MapOutputTracker
 14/09/19 10:41:11 INFO SparkEnv: Registering BlockManagerMaster
 14/09/19 10:41:11 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140919104111-145e
 14/09/19 10:41:11 INFO Utils: Successfully started service 'Connection 
 manager for block manager' on port 45408.
 14/09/19 10:41:11 INFO ConnectionManager: Bound socket to port 45408 with id 
 = ConnectionManagerId(laptop-x.local,45408)
 14/09/19 10:41:11 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
 14/09/19 10:41:11 INFO BlockManagerMaster: Trying to register BlockManager
 14/09/19 10:41:11 INFO BlockManagerMasterActor: Registering block manager 
 laptop-x.local:45408 with 265.4 MB RAM
 14/09/19 10:41:11 INFO BlockManagerMaster: Registered BlockManager
 14/09/19 10:41:11 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-5f0289d7-9b20-4bd7-a713-db84c38c4eac
 14/09/19 10:41:11 INFO HttpServer: Starting HTTP Server
 14/09/19 10:41:11 INFO Utils: Successfully started service 'HTTP file server' 
 on port 36556.
 14/09/19 10:41:11 INFO Utils: Successfully started service 'SparkUI' on port 
 4040.
 14/09/19 10:41:11 INFO SparkUI: Started SparkUI at 
 http://laptop-frens-jan.local:4040
 14/09/19 10:41:12 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/09/19 10:41:12 INFO SparkContext: Added JAR 
 file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop2.4/lib/spark-examples-1.1.0-hadoop2.4.0.jar
  at http://192.168.2.2:36556/jars/spark-examples-1.1.0-hadoop2.4.0.jar with 
 timestamp 146072417
 14/09/19 10:41:12 INFO Utils: Copying 
 /home/frens-jan/Desktop/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/cassandra_inputformat.py
  to /tmp/spark-7dbb1b4d-016c-4f8b-858d-f79c9297f58f/cassandra_inputformat.py
 14/09/19 10:41:12 INFO SparkContext: Added file 
 file:/home/frens-jan/Desktop/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/cassandra_inputformat.py
  at http://192.168.2.2:36556/files/cassandra_inputformat.py with timestamp 
 146072419
 14/09/19 10:41:12 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
 akka.tcp://sparkDriver@laptop-frens-jan.local:43790/user/HeartbeatReceiver
 14/09/19 10:41:12 INFO MemoryStore: ensureFreeSpace(167659) called with 
 curMem=0, maxMem=278302556
 14/09/19 10:41:12 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 163.7 KB, free 265.3 MB)
 14/09/19 10:41:12 INFO MemoryStore: ensureFreeSpace(167659) called with 
 curMem=167659, maxMem=278302556
 14/09/19 10:41:12 INFO MemoryStore: Block broadcast_1 stored as values in 
 memory 

[jira] [Commented] (SPARK-2445) MesosExecutorBackend crashes in fine grained mode

2015-05-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545504#comment-14545504
 ] 

Sean Owen commented on SPARK-2445:
--

[~gbow...@fastmail.co.uk] are you saying that SPARK-3535 did actually resolve 
this?

 MesosExecutorBackend crashes in fine grained mode
 -

 Key: SPARK-2445
 URL: https://issues.apache.org/jira/browse/SPARK-2445
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Dario Rexin

 When multiple instances of the MesosExecutorBackend are running on the same 
 slave, they will have the same executorId assigned (equal to the mesos 
 slaveId), but will have a different port (which is randomly assigned). 
 Because of this, it can not register a new BlockManager, because one is 
 already registered with the same executorId, but a different BlockManagerId. 
 More description and a fix can be found in this PR on GitHub:
 https://github.com/apache/spark/pull/1358



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1928) DAGScheduler suspended by local task OOM

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1928.
--
   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Peng Zhen

Resolved long ago by https://github.com/apache/spark/pull/883

 DAGScheduler suspended by local task OOM
 

 Key: SPARK-1928
 URL: https://issues.apache.org/jira/browse/SPARK-1928
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 0.9.0
Reporter: Peng Zhen
Assignee: Peng Zhen
 Fix For: 1.1.0


 DAGScheduler does not handle local task OOM properly, and will wait for the 
 job result forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2133) FileNotFoundException in BlockObjectWriter

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2133.
--
Resolution: Cannot Reproduce

 FileNotFoundException in BlockObjectWriter
 --

 Key: SPARK-2133
 URL: https://issues.apache.org/jira/browse/SPARK-2133
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: YARN
Reporter: Neville Li

 Seeing a lot of this when running ALS on large data sets ( 50GB) and YARN. 
 The job eventually fails after spark.task.maxFailures has been reached.
 {code}
 Exception in thread Thread-3 java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 1.0:677 failed 10 times, most recent failure: Exception failure in TID 
 946 on host lon4-hadoopslave-b501.lon4.spotify.net: 
 java.io.FileNotFoundException: 
 /disk/hd01/yarn/local/usercache/neville/appcache/application_1401944843353_36952/spark-local-20140611033053-6b18/2a/shuffle_0_677_985
  (No such file or directory)
 java.io.FileOutputStream.openAppend(Native Method)
 java.io.FileOutputStream.init(FileOutputStream.java:192)
 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116)
 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177)
 
 org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
 
 org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 org.apache.spark.scheduler.Task.run(Task.scala:51)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-15 Thread Guillaume E.B. (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545533#comment-14545533
 ] 

Guillaume E.B. commented on SPARK-4105:
---

I think I add the bug using another compression codec. I will try to reproduce 
it as soon as I can.

 FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
 shuffle
 -

 Key: SPARK-4105
 URL: https://issues.apache.org/jira/browse/SPARK-4105
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker
 Attachments: JavaObjectToSerialize.java, 
 SparkFailedToUncompressGenerator.scala


 We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
 shuffle read.  Here's a sample stacktrace from an executor:
 {code}
 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
 33053)
 java.io.IOException: FAILED_TO_UNCOMPRESS(5)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
   at 
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
   at 
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
   at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
   at 
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
   at 
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 Here's another occurrence of a similar error:
 

[jira] [Commented] (SPARK-5220) keepPushingBlocks in BlockGenerator terminated when an exception occurs, which causes the block pushing thread to terminate and blocks receiver

2015-05-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545456#comment-14545456
 ] 

Sean Owen commented on SPARK-5220:
--

[~superxma] is this resolved then?

 keepPushingBlocks in BlockGenerator terminated when an exception occurs, 
 which causes the block pushing thread to terminate and blocks receiver  
 -

 Key: SPARK-5220
 URL: https://issues.apache.org/jira/browse/SPARK-5220
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Max Xu

 I am running a Spark streaming application with ReliableKafkaReceiver. It 
 uses BlockGenerator to push blocks to BlockManager. However, writing WALs to 
 HDFS may time out that causes keepPushingBlocks in BlockGenerator to 
 terminate.
 15/01/12 19:07:06 ERROR receiver.BlockGenerator: Error in block pushing thread
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at 
 org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:176)
 at 
 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:160)
 at 
 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:126)
 at 
 org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:124)
 at 
 org.apache.spark.streaming.kafka.ReliableKafkaReceiver.org$apache$spark$streaming$kafka$ReliableKafkaReceiver$$storeBlockAndCommitOffset(ReliableKafkaReceiver.scala:207)
 at 
 org.apache.spark.streaming.kafka.ReliableKafkaReceiver$GeneratedBlockHandler.onPushBlock(ReliableKafkaReceiver.scala:275)
 at 
 org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:181)
 at 
 org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:154)
 at 
 org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:86)
 Then the block pushing thread is done and no subsequent blocks can be pushed 
 into blockManager. In turn this blocks receiver from receiving new data.
 So when running my app and the TimeoutException happens, the 
 ReliableKafkaReceiver stays in ACTIVE status but doesn't do anything at all. 
 The application rogues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5175) bug in updating counters when starting multiple workers/supervisors in actor-based receiver

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5175:
---

Assignee: (was: Apache Spark)

 bug in updating counters when starting multiple workers/supervisors in 
 actor-based receiver
 ---

 Key: SPARK-5175
 URL: https://issues.apache.org/jira/browse/SPARK-5175
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Nan Zhu

 when starting multiple workers(ActorReceiver.scala), we didn't update the 
 counters in it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5174) Missing Document for starting multiple workers/supervisors in actor-based receiver

2015-05-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5174:
---

Assignee: (was: Apache Spark)

 Missing Document for starting multiple workers/supervisors in actor-based 
 receiver
 --

 Key: SPARK-5174
 URL: https://issues.apache.org/jira/browse/SPARK-5174
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Nan Zhu
Priority: Minor

 Currently, the document about starting multiple supervisors/workers are 
 missing, though the implementation provides this capacity
 {code:title=ActorReceiver.scala|borderStyle=solid}
 case props: Props =
 val worker = context.actorOf(props)
 logInfo(Started receiver worker at: + worker.path)
 sender ! worker
   case (props: Props, name: String) =
 val worker = context.actorOf(props, name)
 logInfo(Started receiver worker at: + worker.path)
 sender ! worker
   case _: PossiblyHarmful = hiccups.incrementAndGet()
   case _: Statistics =
 val workers = context.children
 sender ! Statistics(n.get, workers.size, hiccups.get, 
 workers.mkString(\n))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7536) Audit MLlib Python API for 1.4

2015-05-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-7536:
---
Description: 
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala  Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release. SPARK-7665
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python.
** classification
*** StreamingLogisticRegressionWithSGD SPARK-7633
** clustering
*** GaussianMixture SPARK-6258
*** LDA SPARK-6259
*** Power Iteration Clustering SPARK-5962
*** StreamingKMeans SPARK-4118 
** evaluation
*** MultilabelMetrics SPARK-6094 
** feature
*** ElementwiseProduct SPARK-7605
*** PCA SPARK-7604
** linalg
*** Distributed linear algebra SPARK-6100
** pmml.export SPARK-7638
** regression
*** StreamingLinearRegressionWithSGD SPARK-4127
** stat
*** KernelDensity SPARK-7639
** util
*** MLUtils SPARK-6263 

  was:
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala  Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python.
** classification
*** StreamingLogisticRegressionWithSGD SPARK-7633
** clustering
*** GaussianMixture SPARK-6258
*** LDA SPARK-6259
*** Power Iteration Clustering SPARK-5962
*** StreamingKMeans SPARK-4118 
** evaluation
*** MultilabelMetrics SPARK-6094 
** feature
*** ElementwiseProduct SPARK-7605
*** PCA SPARK-7604
** linalg
*** Distributed linear algebra SPARK-6100
** pmml.export SPARK-7638
** regression
*** StreamingLinearRegressionWithSGD SPARK-4127
** stat
*** KernelDensity SPARK-7639
** util
*** MLUtils SPARK-6263 


 Audit MLlib Python API for 1.4
 --

 Key: SPARK-7536
 URL: https://issues.apache.org/jira/browse/SPARK-7536
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang

 For new public APIs added to MLlib, we need to check the generated HTML doc 
 and compare the Scala  Python versions.  We need to track:
 * Inconsistency: Do class/method/parameter names match?
 * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
 be as complete as the Scala doc.
 * API breaking changes: These should be very rare but are occasionally either 
 necessary (intentional) or accidental.  These must be recorded and added in 
 the Migration Guide for this release. SPARK-7665
 ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
 component, please note that as well.
 * Missing classes/methods/parameters: We should create to-do JIRAs for 
 functionality missing from Python.
 ** classification
 *** StreamingLogisticRegressionWithSGD SPARK-7633
 ** clustering
 *** GaussianMixture SPARK-6258
 *** LDA SPARK-6259
 *** Power Iteration Clustering SPARK-5962
 *** StreamingKMeans SPARK-4118 
 ** evaluation
 *** MultilabelMetrics SPARK-6094 
 ** feature
 *** ElementwiseProduct SPARK-7605
 *** PCA SPARK-7604
 ** linalg
 *** Distributed linear algebra SPARK-6100
 ** pmml.export SPARK-7638
 ** regression
 *** StreamingLinearRegressionWithSGD SPARK-4127
 ** stat
 *** KernelDensity SPARK-7639
 ** util
 *** MLUtils SPARK-6263 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7667) MLlib Python API consistency check

2015-05-15 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-7667:
--

 Summary: MLlib Python API consistency check
 Key: SPARK-7667
 URL: https://issues.apache.org/jira/browse/SPARK-7667
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang


Check and ensure the MLlib Python API(class/method/parameter) consistent with 
Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4598:
-
Issue Type: Improvement  (was: Bug)

 Paginate stage page to avoid OOM with  100,000 tasks
 -

 Key: SPARK-4598
 URL: https://issues.apache.org/jira/browse/SPARK-4598
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: meiyoula

 In HistoryServer stage page, clicking the task href in Description, it occurs 
 the GC error. The detail error message is:
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-352] | Error for 
 /history/application_1416206401491_0010/stages/stage/ | 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590)
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-364] | handle failed | 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697)
 java.lang.OutOfMemoryError: GC overhead limit exceeded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1910) Add onBlockComplete API to receiver

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1910:
-
Issue Type: Improvement  (was: Bug)

 Add onBlockComplete API to receiver
 ---

 Key: SPARK-1910
 URL: https://issues.apache.org/jira/browse/SPARK-1910
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Reporter: Hari Shreedharan

 This can allow the receiver to ACK all data that has already been 
 successfully stored by the block generator. This means the receiver's store 
 methods must now receive the block Id, so the receiver can recognize which 
 events are the ones that have been stored



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1107) Add shutdown hook on executor stop to stop running tasks

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1107:
-
Issue Type: Improvement  (was: Bug)

We have a shutdown hook that stops the SparkContext, which is kind of related.

 Add shutdown hook on executor stop to stop running tasks
 

 Key: SPARK-1107
 URL: https://issues.apache.org/jira/browse/SPARK-1107
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Andrew Ash

 Originally reported by aash:
 http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3CCA%2B-p3AHXYhpjXH9fr8jQ5%2B_gc%3DNHjLbOiJB9bHSahfEET5aHBQ%40mail.gmail.com%3E
 Latest in thread:
 http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201402.mbox/%3CCA+-p3AFi7vz=2oty3caa0g+5ekg+a84uvqrl9tgstvgwgyb...@mail.gmail.com%3E
 The most popular approach is to add a shutdown hook that stops running tasks 
 in the executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-604) reconnect if mesos slaves dies

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-604.
-
Resolution: Cannot Reproduce

Stale at this point, without similar findings recently.

 reconnect if mesos slaves dies
 --

 Key: SPARK-604
 URL: https://issues.apache.org/jira/browse/SPARK-604
 Project: Spark
  Issue Type: Bug
  Components: Mesos

 when running on mesos, if a slave goes down, spark doesn't try to reassign 
 the work to another machine.  Even if the slave comes back up, the job is 
 doomed.
 Currently when this happens, we just see this in the driver logs:
 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: Mesos slave lost: 
 201210312057-1560611338-5050-24091-52
 Exception in thread Thread-346 java.util.NoSuchElementException: key not 
 found: value: 201210312057-1560611338-5050-24091-52
 at scala.collection.MapLike$class.default(MapLike.scala:224)
 at scala.collection.mutable.HashMap.default(HashMap.scala:43)
 at scala.collection.MapLike$class.apply(MapLike.scala:135)
 at scala.collection.mutable.HashMap.apply(HashMap.scala:43)
 at 
 spark.scheduler.cluster.ClusterScheduler.slaveLost(ClusterScheduler.scala:255)
 at 
 spark.scheduler.mesos.MesosSchedulerBackend.slaveLost(MesosSchedulerBackend.scala:275)
 12/11/01 16:48:56 INFO mesos.MesosSchedulerBackend: driver.run() returned 
 with code DRIVER_ABORTED



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5331) Spark workers can't find tachyon master as spark-ec2 doesn't set spark.tachyonStore.url

2015-05-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545441#comment-14545441
 ] 

Sean Owen commented on SPARK-5331:
--

[~florianverhein] is this an issue then or just a matter of setting the config 
correctly?

 Spark workers can't find tachyon master as spark-ec2 doesn't set 
 spark.tachyonStore.url
 ---

 Key: SPARK-5331
 URL: https://issues.apache.org/jira/browse/SPARK-5331
 Project: Spark
  Issue Type: Bug
  Components: EC2
 Environment: Running on EC2 via modified spark-ec2 scripts (to get 
 dependencies right so tachyon starts)
 Using tachyon 0.5.0 built against hadoop 2.4.1
 Spark 1.2.0 built against tachyon 0.5.0 and hadoop 0.4.1
 Tachyon configured using the template in 0.5.0 but updated with slave list 
 and master variables etc..
Reporter: Florian Verhein

 ps -ef | grep Tachyon 
 shows Tachyon running on the master (and the slave) node with correct setting:
 -Dtachyon.master.hostname=ec2-54-252-156-187.ap-southeast-2.compute.amazonaws.com
 However from stderr log on worker running the SparkTachyonPi example:
 15/01/20 06:00:56 INFO CacheManager: Partition rdd_0_0 not found, computing it
 15/01/20 06:00:56 INFO : Trying to connect master @ localhost/127.0.0.1:19998
 15/01/20 06:00:56 ERROR : Failed to connect (1) to master 
 localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
 15/01/20 06:00:57 ERROR : Failed to connect (2) to master 
 localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
 15/01/20 06:00:58 ERROR : Failed to connect (3) to master 
 localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
 15/01/20 06:00:59 ERROR : Failed to connect (4) to master 
 localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
 15/01/20 06:01:00 ERROR : Failed to connect (5) to master 
 localhost/127.0.0.1:19998 : java.net.ConnectException: Connection refused
 15/01/20 06:01:01 WARN TachyonBlockManager: Attempt 1 to create tachyon dir 
 null failed
 java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 
 after 5 attempts
   at tachyon.client.TachyonFS.connect(TachyonFS.java:293)
   at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011)
   at tachyon.client.TachyonFS.exist(TachyonFS.java:633)
   at 
 org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117)
   at 
 org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
 org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106)
   at 
 org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57)
   at 
 org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:94)
   at 
 org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:88)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:773)
   at 
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
   at 
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master 
 localhost/127.0.0.1:19998 after 5 attempts
   at tachyon.master.MasterClient.connect(MasterClient.java:178)
   at tachyon.client.TachyonFS.connect(TachyonFS.java:290)
   ... 28 more
 Caused by: 

[jira] [Resolved] (SPARK-5246) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5246.
--
Resolution: Done
  Assignee: Vladimir Grigor

(Really, was fixed by a PR for mesos)

 spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does 
 not resolve
 --

 Key: SPARK-5246
 URL: https://issues.apache.org/jira/browse/SPARK-5246
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Vladimir Grigor
Assignee: Vladimir Grigor

 How to reproduce: 
 1)  http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html 
 should be sufficient to setup VPC for this bug. After you followed that 
 guide, start new instance in VPC, ssh to it (though NAT server)
 2) user starts a cluster in VPC:
 {code}
 ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
 --subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
 Setting up security groups...
 
 (omitted for brevity)
 10.1.1.62
 10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop
 no org.apache.spark.deploy.master.Master to stop
 starting org.apache.spark.deploy.master.Master, logging to 
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
 failed to launch org.apache.spark.deploy.master.Master:
   at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
   ... 12 more
 full log in 
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
 10.1.1.62: starting org.apache.spark.deploy.worker.Worker, logging to 
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
 10.1.1.62: failed to launch org.apache.spark.deploy.worker.Worker:
 10.1.1.62:at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
 10.1.1.62:... 12 more
 10.1.1.62: full log in 
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
 [timing] spark-standalone setup:  00h 00m 28s
  
 (omitted for brevity)
 {code}
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
 {code}
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp 
 :::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar
  -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
 org.apache.spark.deploy.master.Master --ip 10.1.1.151 --port 7077 
 --webui-port 8080
 
 15/01/14 07:34:47 INFO master.Master: Registered signal handlers for [TERM, 
 HUP, INT]
 Exception in thread main java.net.UnknownHostException: ip-10-1-1-151: 
 ip-10-1-1-151: Name or service not known
 at java.net.InetAddress.getLocalHost(InetAddress.java:1473)
 at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:620)
 at 
 org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:612)
 at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:612)
 at 
 org.apache.spark.util.Utils$.localIpAddressHostname$lzycompute(Utils.scala:613)
 at 
 org.apache.spark.util.Utils$.localIpAddressHostname(Utils.scala:613)
 at 
 org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
 at 
 org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.util.Utils$.localHostName(Utils.scala:665)
 at 
 org.apache.spark.deploy.master.MasterArguments.init(MasterArguments.scala:27)
 at org.apache.spark.deploy.master.Master$.main(Master.scala:819)
 at org.apache.spark.deploy.master.Master.main(Master.scala)
 Caused by: java.net.UnknownHostException: ip-10-1-1-151: Name or service not 
 known
 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
 at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
 at 
 java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
 at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
 ... 12 more
 {code}
 Problem is that instance launched in VPC may be not able to resolve own local 
 hostname. Please see  
 https://forums.aws.amazon.com/thread.jspa?threadID=92092.
 I am going to submit a fix for this problem since I need this functionality 
 asap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (SPARK-3942) LogisticRegressionWithLBFGS should not use SquaredL2Updater

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3942.
--
Resolution: Won't Fix

 LogisticRegressionWithLBFGS should not use SquaredL2Updater 
 

 Key: SPARK-3942
 URL: https://issues.apache.org/jira/browse/SPARK-3942
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: fuminglin

 LBFGS method use line search for step size, in all mllib`s updater use 
 step-size decreasing with the square root of the number of iterations, this 
 may cause Wolfe condition not hold.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3967) Spark applications fail in yarn-cluster mode when the directories configured in yarn.nodemanager.local-dirs are located on different disks/partitions

2015-05-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3967.
--
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Christophe Préaud

 Spark applications fail in yarn-cluster mode when the directories configured 
 in yarn.nodemanager.local-dirs are located on different disks/partitions
 -

 Key: SPARK-3967
 URL: https://issues.apache.org/jira/browse/SPARK-3967
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Christophe Préaud
Assignee: Christophe Préaud
 Fix For: 1.2.0

 Attachments: spark-1.1.0-utils-fetch.patch, 
 spark-1.1.0-yarn_cluster_tmpdir.patch


 Spark applications fail from time to time in yarn-cluster mode (but not in 
 yarn-client mode) when yarn.nodemanager.local-dirs (Hadoop YARN config) is 
 set to a comma-separated list of directories which are located on different 
 disks/partitions.
 Steps to reproduce:
 1. Set yarn.nodemanager.local-dirs (in yarn-site.xml) to a list of 
 directories located on different partitions (the more you set, the more 
 likely it will be to reproduce the bug):
 (...)
 property
   nameyarn.nodemanager.local-dirs/name
   
 valuefile:/d1/yarn/local/nm-local-dir,file:/d2/yarn/local/nm-local-dir,file:/d3/yarn/local/nm-local-dir,file:/d4/yarn/local/nm-local-dir,file:/d5/yarn/local/nm-local-dir,file:/d6/yarn/local/nm-local-dir,file:/d7/yarn/local/nm-local-dir/value
 /property
 (...)
 2. Launch (several times) an application in yarn-cluster mode, it will fail 
 (apparently randomly) from time to time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >