date:20150515

[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544987#comment-14544987
 ] 

Josh Rosen commented on SPARK-7660:
---

Note that this affects more than just Spark 1.4.0; I'll trace back and figure 
out the complete list of affected versions tomorrow, but I think that any 
version that relied on a Snappy-java library published after mid June or July 
2014 may be affected.

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice.
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545005#comment-14545005
 ] 

Josh Rosen commented on SPARK-7660:
---

I pushed 
https://github.com/apache/spark/commit/7da33ce5057ff965eec19ce662465b64a3564019 
as a hotfix, which masks the bug in a way that fixes the JavaAPISuite Jenkins 
failures.  We'll still fix this bug before 1.4, but in the meantime this will 
make it easy to recognize new Jenkins failures.

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice (see 
 https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
 details).
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-7660:
-

Assignee: Josh Rosen

 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice (see 
 https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
 details).
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this issue may occur, see my upstream 
 tickets).  I think that this double-close happens somewhere in some test code 
 that was added as part of my Tungsten shuffle patch, exposing this bug (to 
 see this, download a recent build of master and run 
 https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
 force the test execution order that triggers the bug).
 I think that it's rare that this bug would lead to silent failures like this. 
 In more realistic workloads that aren't writing only a handful of bytes per 
 task, I would expect this issue to lead to stream corruption issues like 
 SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7662) Exception of multi-attribute generator anlysis in projection

2015-05-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545072#comment-14545072
 ] 

Apache Spark commented on SPARK-7662:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/6178

 Exception of multi-attribute generator anlysis in projection
 

 Key: SPARK-7662
 URL: https://issues.apache.org/jira/browse/SPARK-7662
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 {code}
 select explode(map(value, key)) from src;
 {code}
 It throws exception like
 {panel}
 org.apache.spark.sql.AnalysisException: The number of aliases supplied in the 
 AS clause does not match the number of columns output by the UDTF expected 2 
 aliases but got _c0 ;
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7662) Exception of multi-attribute generator anlysis in projection

2015-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7662:
---

Assignee: Apache Spark

 Exception of multi-attribute generator anlysis in projection
 

 Key: SPARK-7662
 URL: https://issues.apache.org/jira/browse/SPARK-7662
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Apache Spark
Priority: Blocker

 {code}
 select explode(map(value, key)) from src;
 {code}
 It throws exception like
 {panel}
 org.apache.spark.sql.AnalysisException: The number of aliases supplied in the 
 AS clause does not match the number of columns output by the UDTF expected 2 
 aliases but got _c0 ;
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7662) Exception of multi-attribute generator anlysis in projection

2015-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7662:
---

Assignee: (was: Apache Spark)

 Exception of multi-attribute generator anlysis in projection
 

 Key: SPARK-7662
 URL: https://issues.apache.org/jira/browse/SPARK-7662
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 {code}
 select explode(map(value, key)) from src;
 {code}
 It throws exception like
 {panel}
 org.apache.spark.sql.AnalysisException: The number of aliases supplied in the 
 AS clause does not match the number of columns output by the UDTF expected 2 
 aliases but got _c0 ;
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-7660:
-

 Summary: Snappy-java buffer-sharing bug leads to data corruption / 
test failures
 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker


snappy-java contains a bug that can lead to situations where separate 
SnappyOutputStream instances end up sharing the same input and output buffers, 
which can lead to data corruption issues.  See 
https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and 
https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue.

I discovered this issue because the buffer-sharing was leading to a test 
failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
the wrong answer because both tasks wrote their output using the same 
compression buffers and one task won the race, causing its output to be written 
to both shuffle output files. As a result, the test returned the result of 
collecting one partition twice.

The buffer-sharing can only occur if {{close()}} is called twice on the same 
SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a 
more precise description of when this issue may occur, see my upstream 
tickets).  I think that this double-close happens somewhere in some test code 
that was added as part of my Tungsten shuffle patch, exposing this bug (to see 
this, download a recent build of master and run 
https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
force the test execution order that triggers the bug).

I think that it's rare that this bug would lead to silent failures like this. 
In more realistic workloads that aren't writing only a handful of bytes per 
task, I would expect this issue to lead to stream corruption issues like 
SPARK-4105.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib

2015-05-15 Thread Alex (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545027#comment-14545027
 ] 

Alex commented on SPARK-2344:
-

Hi,

How are you? I have couple of questions:

1) When are you planning to submit the FCM to the main spark branch? (I'm
interested working on top of it for Feature Weight FCM improvements)

2) How to know if there is a way for Spark to make the RDD distribution
based on input data columns rather then rows ?


Thanks,
Alex


 Add Fuzzy C-Means algorithm to MLlib
 

 Key: SPARK-2344
 URL: https://issues.apache.org/jira/browse/SPARK-2344
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Alex
Priority: Minor
  Labels: clustering
   Original Estimate: 1m
  Remaining Estimate: 1m

 I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib.
 FCM is very similar to K - Means which is already implemented, and they 
 differ only in the degree of relationship each point has with each cluster:
 (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1.
 As part of the implementation I would like:
 - create a base class for K- Means and FCM
 - implement the relationship for each algorithm differently (in its class)
 I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6747) Support List as a return type in Hive UDF

2015-05-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545077#comment-14545077
 ] 

Apache Spark commented on SPARK-6747:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/6179

 Support List as a return type in Hive UDF
 ---

 Key: SPARK-6747
 URL: https://issues.apache.org/jira/browse/SPARK-6747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takeshi Yamamuro

 The current implementation can't handle List as a return type in Hive UDF.
 We assume an UDF below;
 public class UDFToListString extends UDF {
 public ListString evaluate(Object o) {
 return Arrays.asList(xxx, yyy, zzz);
 }
 }
 An exception of scala.MatchError is thrown as follows when the UDF used;
 scala.MatchError: interface java.util.List (of class java.lang.Class)
   at 
 org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
   at 
 org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
   at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
   at 
 org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
   at 
 org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
   at 
 scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
   at 
 scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
 ...
 To fix this problem, we need to add an entry for List in 
 HiveInspectors#javaClassToDataType.
 However, it has one difficulty because of type erasure in JVM.
 We assume that lines below are appended in HiveInspectors#javaClassToDataType;
 // list type
 case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] =
 val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType]
 println(tpe.getActualTypeArguments()(0).toString()) = 'E'
 This logic fails to catch a component type in List.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7660) Snappy-java buffer-sharing bug leads to data corruption / test failures

2015-05-15 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7660:
--
Description: 
snappy-java contains a bug that can lead to situations where separate 
SnappyOutputStream instances end up sharing the same input and output buffers, 
which can lead to data corruption issues.  See 
https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and 
https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue.

I discovered this issue because the buffer-sharing was leading to a test 
failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
the wrong answer because both tasks wrote their output using the same 
compression buffers and one task won the race, causing its output to be written 
to both shuffle output files. As a result, the test returned the result of 
collecting one partition twice (see 
https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
details).

The buffer-sharing can only occur if {{close()}} is called twice on the same 
SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a 
more precise description of when this issue may occur, see my upstream 
tickets).  I think that this double-close happens somewhere in some test code 
that was added as part of my Tungsten shuffle patch, exposing this bug (to see 
this, download a recent build of master and run 
https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
force the test execution order that triggers the bug).

I think that it's rare that this bug would lead to silent failures like this. 
In more realistic workloads that aren't writing only a handful of bytes per 
task, I would expect this issue to lead to stream corruption issues like 
SPARK-4105.

  was:
snappy-java contains a bug that can lead to situations where separate 
SnappyOutputStream instances end up sharing the same input and output buffers, 
which can lead to data corruption issues.  See 
https://github.com/xerial/snappy-java/issues/107 for my upstream bug report and 
https://github.com/xerial/snappy-java/pull/108 for my patch to fix this issue.

I discovered this issue because the buffer-sharing was leading to a test 
failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
the wrong answer because both tasks wrote their output using the same 
compression buffers and one task won the race, causing its output to be written 
to both shuffle output files. As a result, the test returned the result of 
collecting one partition twice.

The buffer-sharing can only occur if {{close()}} is called twice on the same 
SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for a 
more precise description of when this issue may occur, see my upstream 
tickets).  I think that this double-close happens somewhere in some test code 
that was added as part of my Tungsten shuffle patch, exposing this bug (to see 
this, download a recent build of master and run 
https://gist.github.com/JoshRosen/eb3257a75c16597d769f locally in order to 
force the test execution order that triggers the bug).

I think that it's rare that this bug would lead to silent failures like this. 
In more realistic workloads that aren't writing only a handful of bytes per 
task, I would expect this issue to lead to stream corruption issues like 
SPARK-4105.


 Snappy-java buffer-sharing bug leads to data corruption / test failures
 ---

 Key: SPARK-7660
 URL: https://issues.apache.org/jira/browse/SPARK-7660
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.4.0
Reporter: Josh Rosen
Priority: Blocker

 snappy-java contains a bug that can lead to situations where separate 
 SnappyOutputStream instances end up sharing the same input and output 
 buffers, which can lead to data corruption issues.  See 
 https://github.com/xerial/snappy-java/issues/107 for my upstream bug report 
 and https://github.com/xerial/snappy-java/pull/108 for my patch to fix this 
 issue.
 I discovered this issue because the buffer-sharing was leading to a test 
 failure in JavaAPISuite: one of the repartition-and-sort tests was returning 
 the wrong answer because both tasks wrote their output using the same 
 compression buffers and one task won the race, causing its output to be 
 written to both shuffle output files. As a result, the test returned the 
 result of collecting one partition twice (see 
 https://github.com/apache/spark/pull/5868#issuecomment-101954962 for more 
 details).
 The buffer-sharing can only occur if {{close()}} is called twice on the same 
 SnappyOutputStream _and_ the JVM experiences little GC / memory pressure (for 
 a more precise description of when this

[jira] [Updated] (SPARK-6258) Python MLlib API missing items: Clustering

2015-05-15 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6258:
-
Fix Version/s: 1.4.0

 Python MLlib API missing items: Clustering
 --

 Key: SPARK-6258
 URL: https://issues.apache.org/jira/browse/SPARK-6258
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang
 Fix For: 1.4.0


 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k
 GaussianMixture
 * setInitialModel
 GaussianMixtureModel
 * k
 Completely missing items which should be fixed in separate JIRAs (which have 
 been created and linked to the umbrella JIRA)
 * LDA
 * PowerIterationClustering
 * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7591) FSBasedRelation interface tweaks

2015-05-15 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-7591.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6150
[https://github.com/apache/spark/pull/6150]

 FSBasedRelation interface tweaks
 

 Key: SPARK-7591
 URL: https://issues.apache.org/jira/browse/SPARK-7591
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.4.0


 # Renaming {{FSBasedRelation}} to {{HadoopFsRelation}}
   Since itss all coupled with Hadoop {{FileSystem}} and job API.
 # {{HadoopFsRelation}} should have a no-arg constructor
   {{paths}} and {{partitionColumns}} should just be methods to be overridden, 
 rather than constructor arguments. This makes data source developers life 
 easier by having a no-arg constructor and being serialization friendly.
 # Renaming {{HadoopFsRelation.prepareForWrite}} to 
 {{HadoopFsRelation.prepareJobForWrite}}
   The new name explicitly suggests developers should only touch the {{Job}} 
 instance for preparation work (which is also documented in Scaladoc).
 # Allowing serialization while creating {{OutputWriter}}s
   To be more precise, {{OutputWriter}}s are never created on driver side and 
 serialized to executor side. But the factory that creates {{OutputWriter}}s 
 should be created on driver side and serialized.
   The reason behind this is that, passing all needed materials to 
 {{OutputWriter}} instances via Hadoop Configuration is doable but sometimes 
 neither intuitive nor convenient. Resorting to serialization makes data 
 source developers' life easier. Actually this happens when I was migrating 
 the Parquet data source, and wanted to pass the final output path (instead of 
 temporary work path) to the output writer (see 
 [here|https://github.com/liancheng/spark/commit/ec9950c591e5b981ce20fab96562db28488e0035#diff-53521d336f7259e859fea4d3ca4dc888R74]).
  There I have to put a property into the Configuration object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7621) Report KafkaReceiver MessageHandler errors so StreamingListeners can take action

2015-05-15 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545105#comment-14545105
 ] 

Saisai Shao commented on SPARK-7621:


Hi [~jerluc], you could submit a related PR on Github, Spark community submits 
patch on Github rather than JIRA.

 Report KafkaReceiver MessageHandler errors so StreamingListeners can take 
 action
 

 Key: SPARK-7621
 URL: https://issues.apache.org/jira/browse/SPARK-7621
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0, 1.3.1
Reporter: Jeremy A. Lucas
 Fix For: 1.3.1

 Attachments: SPARK-7621.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, when a MessageHandler (for any of the Kafka Receiver 
 implementations) encounters an error handling a message, the error is only 
 logged with:
 {code:none}
 case e: Exception = logError(Error handling message, e)
 {code}
 It would be _incredibly_ useful to be able to notify any registered 
 StreamingListener of this receiver error (especially since this 
 {{try...catch}} block masks more fatal Kafka connection exceptions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7269) Incorrect aggregation analysis

2015-05-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544979#comment-14544979
 ] 

Apache Spark commented on SPARK-7269:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/6173

 Incorrect aggregation analysis
 --

 Key: SPARK-7269
 URL: https://issues.apache.org/jira/browse/SPARK-7269
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 In a case insensitive analyzer (HiveContext), the attribute name captial 
 differences will fail the analysis check for aggregation.
 {code}
 test(check analysis failed in case in-sensitive) {
 Seq(1,2,3).map(i = (i, i.toString)).toDF(key, 
 value).registerTempTable(df_analysis)
 sql(SELECT kEy from df_analysis group by key)
 }
 {code}
 {noformat}
 expression 'kEy' is neither present in the group by, nor is it an aggregate 
 function. Add to group by or wrap in first() if you don't care which value 
 you get.;
 org.apache.spark.sql.AnalysisException: expression 'kEy' is neither present 
 in the group by, nor is it an aggregate function. Add to group by or wrap in 
 first() if you don't care which value you get.;
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:39)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:85)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$4.apply(CheckAnalysis.scala:101)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$4.apply(CheckAnalysis.scala:101)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:101)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:89)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:39)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1121)
   at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133)
   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:97)
   at 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply$mcV$sp(SQLQuerySuite.scala:408)
   at 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply(SQLQuerySuite.scala:406)
   at 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply(SQLQuerySuite.scala:406)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6258) Python MLlib API missing items: Clustering

2015-05-15 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-6258.
--
Resolution: Fixed

Issue resolved by pull request 6087
[https://github.com/apache/spark/pull/6087]

 Python MLlib API missing items: Clustering
 --

 Key: SPARK-6258
 URL: https://issues.apache.org/jira/browse/SPARK-6258
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k
 GaussianMixture
 * setInitialModel
 GaussianMixtureModel
 * k
 Completely missing items which should be fixed in separate JIRAs (which have 
 been created and linked to the umbrella JIRA)
 * LDA
 * PowerIterationClustering
 * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API

2015-05-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545017#comment-14545017
 ] 

Apache Spark commented on SPARK-7654:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6175

 DataFrameReader and DataFrameWriter for input/output API
 

 Key: SPARK-7654
 URL: https://issues.apache.org/jira/browse/SPARK-7654
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 We have a proliferation of save options now. It'd make more sense to have a 
 builder pattern for write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7661) Support for dynamic allocation of executors in Kinesis Spark Streaming

2015-05-15 Thread Murtaza Kanchwala (JIRA)

Murtaza Kanchwala created SPARK-7661:


 Summary: Support for dynamic allocation of executors in Kinesis 
Spark Streaming
 Key: SPARK-7661
 URL: https://issues.apache.org/jira/browse/SPARK-7661
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.3.1
 Environment: AWS-EMR
Reporter: Murtaza Kanchwala


Currently the logic for the no. of executors is (N + 1), where N is no. of 
shards in a Kinesis Stream.

My Requirement is that if I use this Resharding util for Amazon Kinesis :

Amazon Kinesis Resharding : 
https://github.com/awslabs/amazon-kinesis-scaling-utils

Then there should be some way to allocate executors on the basis of no. of 
shards directly (for Spark Streaming only).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API

2015-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7654:
---

Assignee: Apache Spark  (was: Reynold Xin)

 DataFrameReader and DataFrameWriter for input/output API
 

 Key: SPARK-7654
 URL: https://issues.apache.org/jira/browse/SPARK-7654
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 We have a proliferation of save options now. It'd make more sense to have a 
 builder pattern for write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7654) DataFrameReader and DataFrameWriter for input/output API

2015-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7654:
---

Assignee: Reynold Xin  (was: Apache Spark)

 DataFrameReader and DataFrameWriter for input/output API
 

 Key: SPARK-7654
 URL: https://issues.apache.org/jira/browse/SPARK-7654
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 We have a proliferation of save options now. It'd make more sense to have a 
 builder pattern for write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input

2015-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7651:
---

Assignee: Apache Spark

 PySpark GMM predict, predictSoft should fail on bad input
 -

 Key: SPARK-7651
 URL: https://issues.apache.org/jira/browse/SPARK-7651
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Minor

 In PySpark, GaussianMixtureModel predict and predictSoft test if the argument 
 is an RDD and operate correctly if so.  But if the argument is not an RDD, 
 they fail silently, returning nothing.
 [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176]
 Instead, they should raise errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input

2015-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7651:
---

Assignee: (was: Apache Spark)

 PySpark GMM predict, predictSoft should fail on bad input
 -

 Key: SPARK-7651
 URL: https://issues.apache.org/jira/browse/SPARK-7651
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Joseph K. Bradley
Priority: Minor

 In PySpark, GaussianMixtureModel predict and predictSoft test if the argument 
 is an RDD and operate correctly if so.  But if the argument is not an RDD, 
 they fail silently, returning nothing.
 [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176]
 Instead, they should raise errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7651) PySpark GMM predict, predictSoft should fail on bad input

2015-05-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545126#comment-14545126
 ] 

Apache Spark commented on SPARK-7651:
-

User 'FlytxtRnD' has created a pull request for this issue:
https://github.com/apache/spark/pull/6180

 PySpark GMM predict, predictSoft should fail on bad input
 -

 Key: SPARK-7651
 URL: https://issues.apache.org/jira/browse/SPARK-7651
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Joseph K. Bradley
Priority: Minor

 In PySpark, GaussianMixtureModel predict and predictSoft test if the argument 
 is an RDD and operate correctly if so.  But if the argument is not an RDD, 
 they fail silently, returning nothing.
 [https://github.com/apache/spark/blob/11a1a135d1fe892cd48a9116acc7554846aed84c/python/pyspark/mllib/clustering.py#L176]
 Instead, they should raise errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7586) User guide update for spark.ml Word2Vec

2015-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7586:
---

Assignee: Apache Spark  (was: Xusen Yin)

 User guide update for spark.ml Word2Vec
 ---

 Key: SPARK-7586
 URL: https://issues.apache.org/jira/browse/SPARK-7586
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Assignee: Apache Spark

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}
 Note: I created a new subsection for links to spark.ml-specific guides in 
 this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
 subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7586) User guide update for spark.ml Word2Vec

2015-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7586:
---

Assignee: Xusen Yin  (was: Apache Spark)

 User guide update for spark.ml Word2Vec
 ---

 Key: SPARK-7586
 URL: https://issues.apache.org/jira/browse/SPARK-7586
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Assignee: Xusen Yin

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}
 Note: I created a new subsection for links to spark.ml-specific guides in 
 this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
 subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7586) User guide update for spark.ml Word2Vec

2015-05-15 Thread Apache Spark (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545153#comment-14545153
]

Apache Spark commented on SPARK-7586:
-

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6181

User guide update for spark.ml Word2Vec
---

Key: SPARK-7586
URL: https://issues.apache.org/jira/browse/SPARK-7586
Project: Spark
Issue Type: Documentation
Components: Documentation, ML
Reporter: Joseph K. Bradley
Assignee: Xusen Yin

Copied from [SPARK-7443]:
{quote}
Now that we have algorithms in spark.ml which are not in spark.mllib, we
should start making subsections for the spark.ml API as needed. We can follow
the structure of the spark.mllib user guide.
* The spark.ml user guide can provide: (a) code examples and (b) info on
algorithms which do not exist in spark.mllib.
* We should not duplicate info in the spark.ml guides. Since spark.mllib is
still the primary API, we should provide links to the corresponding
algorithms in the spark.mllib user guide for more info.
{quote}
Note: I created a new subsection for links to spark.ml-specific guides in
this JIRA's PR: [SPARK-7557]. This transformer can go within the new
subsection. I'll try to get that PR merged ASAP.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 309 matches

Mail list logo