[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436335#comment-15436335
 ] 

DB Tsai commented on SPARK-17163:
-

I voted for merging into one interface as well. Since binary LOR can be 
represented as matrix like MLOR we can always return matrix and intercepts for 
BLOR and MLOR. For BLOR, I feel like flatten the matrix and set intercept as 
zero is too hacky, and we could just throw exception. Finally, we can set two 
classes problem default to pivoting, and classes larger than 2 use MLOR without 
pivoting. I like what Yanbo suggested, and we can default to auto. Users can 
make it binomial or multinomial. Thanks. 


> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436326#comment-15436326
 ] 

Seth Hendrickson commented on SPARK-17163:
--

Good catch, thanks!

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17201) Investigate numerical instability for MLOR without regularization

2016-08-24 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436318#comment-15436318
 ] 

DB Tsai commented on SPARK-17201:
-

This makes sense. Let's keep an eye on this, and figure out the interface 
first. Patching this with pivoting is relative easy, and can be done in a way 
that the model format is not changed by unpivoting the coefficients and center 
them again. Thanks.

> Investigate numerical instability for MLOR without regularization
> -
>
> Key: SPARK-17201
> URL: https://issues.apache.org/jira/browse/SPARK-17201
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> As mentioned 
> [here|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression], when no 
> regularization is applied in Softmax regression, second order Newton solvers 
> may run into numerical instability problems. We should investigate this in 
> practice and find a solution, possibly by implementing pivoting when no 
> regularization is applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17232) Expecting same behavior after loading a dataframe with dots in column name

2016-08-24 Thread Jagadeesan A S (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagadeesan A S updated SPARK-17232:
---
Comment: was deleted

(was: I'm not able to reproduce the issue. 

{code:xml}
scala> readDf.rdd
res2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] 
at rdd at :26

scala> readDf.show()
+---+---+
|a.b|a.c|
+---+---+
|  1|  2|
+---+---+

{code}
)

> Expecting same behavior after loading a dataframe with dots in column name
> --
>
> Key: SPARK-17232
> URL: https://issues.apache.org/jira/browse/SPARK-17232
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Louis Salin
>
> In Spark 2.0, the behavior of a dataframe changes after saving and reloading 
> it when there are dots in the column names. In the example below, I was able 
> to call the {{rdd}} function for a newly created dataframe. However, after 
> saving it and reloading it, an exception gets thrown when calling the {{rdd}} 
> function.
> from a spark-shell:
> {{scala> val simpleDf = Seq((1, 2)).toDF("a.b", "a.c")}}
> Res1: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\]
> {{scala> simpleDf.rdd}}
> Res2: org.apache.spark.rdd.RDD\[org.apache.spark.sql.Row\] = 
> MapPartitionsRDD\[7\] at rdd at :29
> {{scala> simpleDf.write.parquet("/user/lsalin/simpleDf")}}
> {{scala> val readDf = spark.read.parquet("/user/lsalin/simpleDf")}}
> Res4: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\]
> {{scala> readDf.rdd}}
> {noformat}
> org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b, 
> a.c];
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
>   at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
>   at 
> 

[jira] [Commented] (SPARK-17232) Expecting same behavior after loading a dataframe with dots in column name

2016-08-24 Thread Jagadeesan A S (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436289#comment-15436289
 ] 

Jagadeesan A S commented on SPARK-17232:


I'm not able to reproduce the issue. 

{code:xml}
scala> readDf.rdd
res2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] 
at rdd at :26

scala> readDf.show()
+---+---+
|a.b|a.c|
+---+---+
|  1|  2|
+---+---+

{code}


> Expecting same behavior after loading a dataframe with dots in column name
> --
>
> Key: SPARK-17232
> URL: https://issues.apache.org/jira/browse/SPARK-17232
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Louis Salin
>
> In Spark 2.0, the behavior of a dataframe changes after saving and reloading 
> it when there are dots in the column names. In the example below, I was able 
> to call the {{rdd}} function for a newly created dataframe. However, after 
> saving it and reloading it, an exception gets thrown when calling the {{rdd}} 
> function.
> from a spark-shell:
> {{scala> val simpleDf = Seq((1, 2)).toDF("a.b", "a.c")}}
> Res1: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\]
> {{scala> simpleDf.rdd}}
> Res2: org.apache.spark.rdd.RDD\[org.apache.spark.sql.Row\] = 
> MapPartitionsRDD\[7\] at rdd at :29
> {{scala> simpleDf.write.parquet("/user/lsalin/simpleDf")}}
> {{scala> val readDf = spark.read.parquet("/user/lsalin/simpleDf")}}
> Res4: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\]
> {{scala> readDf.rdd}}
> {noformat}
> org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b, 
> a.c];
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
>   at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
>   at 
> 

[jira] [Comment Edited] (SPARK-17232) Expecting same behavior after loading a dataframe with dots in column name

2016-08-24 Thread Jagadeesan A S (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436285#comment-15436285
 ] 

Jagadeesan A S edited comment on SPARK-17232 at 8/25/16 5:01 AM:
-

I'm not able to reproduce the issue. 

{code:xml}
scala> readDf.rdd
res2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] 
at rdd at :26

scala> readDf.show()
+---+---+
|a.b|a.c|
+---+---+
|  1|  2|
+---+---+

{code}



was (Author: as2):
I'm not able to reproduce the issue. 

{code:xml}
scala> readDf.rdd
res2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] 
at rdd at :26
{code}


> Expecting same behavior after loading a dataframe with dots in column name
> --
>
> Key: SPARK-17232
> URL: https://issues.apache.org/jira/browse/SPARK-17232
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Louis Salin
>
> In Spark 2.0, the behavior of a dataframe changes after saving and reloading 
> it when there are dots in the column names. In the example below, I was able 
> to call the {{rdd}} function for a newly created dataframe. However, after 
> saving it and reloading it, an exception gets thrown when calling the {{rdd}} 
> function.
> from a spark-shell:
> {{scala> val simpleDf = Seq((1, 2)).toDF("a.b", "a.c")}}
> Res1: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\]
> {{scala> simpleDf.rdd}}
> Res2: org.apache.spark.rdd.RDD\[org.apache.spark.sql.Row\] = 
> MapPartitionsRDD\[7\] at rdd at :29
> {{scala> simpleDf.write.parquet("/user/lsalin/simpleDf")}}
> {{scala> val readDf = spark.read.parquet("/user/lsalin/simpleDf")}}
> Res4: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\]
> {{scala> readDf.rdd}}
> {noformat}
> org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b, 
> a.c];
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
>   at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)

[jira] [Commented] (SPARK-17232) Expecting same behavior after loading a dataframe with dots in column name

2016-08-24 Thread Jagadeesan A S (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436285#comment-15436285
 ] 

Jagadeesan A S commented on SPARK-17232:


I'm not able to reproduce the issue. 

{code:xml}
scala> readDf.rdd
res2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] 
at rdd at :26
{code}


> Expecting same behavior after loading a dataframe with dots in column name
> --
>
> Key: SPARK-17232
> URL: https://issues.apache.org/jira/browse/SPARK-17232
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Louis Salin
>
> In Spark 2.0, the behavior of a dataframe changes after saving and reloading 
> it when there are dots in the column names. In the example below, I was able 
> to call the {{rdd}} function for a newly created dataframe. However, after 
> saving it and reloading it, an exception gets thrown when calling the {{rdd}} 
> function.
> from a spark-shell:
> {{scala> val simpleDf = Seq((1, 2)).toDF("a.b", "a.c")}}
> Res1: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\]
> {{scala> simpleDf.rdd}}
> Res2: org.apache.spark.rdd.RDD\[org.apache.spark.sql.Row\] = 
> MapPartitionsRDD\[7\] at rdd at :29
> {{scala> simpleDf.write.parquet("/user/lsalin/simpleDf")}}
> {{scala> val readDf = spark.read.parquet("/user/lsalin/simpleDf")}}
> Res4: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\]
> {{scala> readDf.rdd}}
> {noformat}
> org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b, 
> a.c];
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
>   at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
>   at 

[jira] [Updated] (SPARK-17190) Removal of HiveSharedState

2016-08-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17190:

Assignee: Xiao Li

> Removal of HiveSharedState
> --
>
> Key: SPARK-17190
> URL: https://issues.apache.org/jira/browse/SPARK-17190
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> Since `HiveClient` is used to interact with the Hive metastore, it should be 
> hidden in `HiveExternalCatalog`. After moving `HiveClient` into 
> `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of 
> `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes 
> straightforward. After removal of `HiveSharedState`, the reflection logic is 
> directly applied on the choice of `ExternalCatalog` types, based on the 
> configuration of `CATALOG_IMPLEMENTATION`. 
> `HiveClient` is also used/invoked by the other entities besides 
> HiveExternalCatalog, we defines the following two APIs:
> {noformat}
>   /**
>* Return the existing [[HiveClient]] used to interact with the metastore.
>*/
>   def getClient: HiveClient
>   /**
>* Return a [[HiveClient]] as a new session
>*/
>   def getNewClient: HiveClient
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17190) Removal of HiveSharedState

2016-08-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17190.
-

Issue resolved by pull request 14757
[https://github.com/apache/spark/pull/14757]

> Removal of HiveSharedState
> --
>
> Key: SPARK-17190
> URL: https://issues.apache.org/jira/browse/SPARK-17190
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.1.0
>
>
> Since `HiveClient` is used to interact with the Hive metastore, it should be 
> hidden in `HiveExternalCatalog`. After moving `HiveClient` into 
> `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of 
> `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes 
> straightforward. After removal of `HiveSharedState`, the reflection logic is 
> directly applied on the choice of `ExternalCatalog` types, based on the 
> configuration of `CATALOG_IMPLEMENTATION`. 
> `HiveClient` is also used/invoked by the other entities besides 
> HiveExternalCatalog, we defines the following two APIs:
> {noformat}
>   /**
>* Return the existing [[HiveClient]] used to interact with the metastore.
>*/
>   def getClient: HiveClient
>   /**
>* Return a [[HiveClient]] as a new session
>*/
>   def getNewClient: HiveClient
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14381) Review spark.ml parity for feature transformers

2016-08-24 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14381:

Fix Version/s: (was: 2.1.0)

> Review spark.ml parity for feature transformers
> ---
>
> Key: SPARK-14381
> URL: https://issues.apache.org/jira/browse/SPARK-14381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality. List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436268#comment-15436268
 ] 

Yanbo Liang commented on SPARK-14381:
-

Resolved this, thanks for working on it.

> Review spark.ml parity for feature transformers
> ---
>
> Key: SPARK-14381
> URL: https://issues.apache.org/jira/browse/SPARK-14381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality. List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14381) Review spark.ml parity for feature transformers

2016-08-24 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-14381.
-
   Resolution: Done
 Assignee: Xusen Yin
Fix Version/s: 2.1.0

> Review spark.ml parity for feature transformers
> ---
>
> Key: SPARK-14381
> URL: https://issues.apache.org/jira/browse/SPARK-14381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
> Fix For: 2.1.0
>
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality. List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264
 ] 

Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:30 AM:
--

* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** PMML SPARK-11239
** toString: print summary
* IsotonicRegressionModel
** single-row prediction SPARK-10413
* StreamingLinearRegression


was (Author: yanboliang):
* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** PMML SPARK-11237
** toString: print summary
* IsotonicRegressionModel
** single-row prediction SPARK-10413
* StreamingLinearRegression

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264
 ] 

Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:29 AM:
--

* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** PMML SPARK-11237
** toString: print summary
* IsotonicRegressionModel
** single-row prediction SPARK-10413
* StreamingLinearRegression


was (Author: yanboliang):
* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** toString: print summary
* IsotonicRegressionModel
** single-row prediction SPARK-10413

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264
 ] 

Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:26 AM:
--

* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** toString: print summary
* IsotonicRegressionModel
** single-row prediction SPARK-10413


was (Author: yanboliang):
* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** toString: print summary SPARK-14712
* IsotonicRegressionModel
** single-row prediction SPARK-10413

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-24 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-14378:
---

Assignee: Yanbo Liang

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264
 ] 

Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:25 AM:
--

* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel (Hold off until SPARK-10780 resolved)
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** toString: print summary SPARK-14712
* IsotonicRegressionModel
** single-row prediction SPARK-10413


was (Author: yanboliang):
* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel 
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** toString: print summary SPARK-14712
* IsotonicRegressionModel
** single-row prediction SPARK-10413

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17228) Not infer/propagate non-deterministic constraints

2016-08-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17228.
-
   Resolution: Fixed
 Assignee: Sameer Agarwal
Fix Version/s: 2.1.0
   2.0.1

> Not infer/propagate non-deterministic constraints
> -
>
> Key: SPARK-17228
> URL: https://issues.apache.org/jira/browse/SPARK-17228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14378) Review spark.ml parity for regression, except trees

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264
 ] 

Yanbo Liang commented on SPARK-14378:
-

* 
GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel)
** single-row prediction SPARK-10413
** initialModel 
** setValidateData (not important for public API)
** optimizer interface SPARK-17136
** LBFGS set numCorrections (not important for public API)
** toString: print summary SPARK-14712
* IsotonicRegressionModel
** single-row prediction SPARK-10413

> Review spark.ml parity for regression, except trees
> ---
>
> Key: SPARK-14378
> URL: https://issues.apache.org/jira/browse/SPARK-14378
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17066) dateFormat should be used when writing dataframes as csv files

2016-08-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17066:

Fix Version/s: 2.1.0
   2.0.1

> dateFormat should be used when writing dataframes as csv files
> --
>
> Key: SPARK-17066
> URL: https://issues.apache.org/jira/browse/SPARK-17066
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Barry Becker
> Fix For: 2.0.1, 2.1.0
>
>
> I noticed this when running tests after pulling and building @lw-lin 's PR 
> (https://github.com/apache/spark/pull/14118). I don't think it is anything 
> wrong with his PR, just that the fix that was made to spark-csv for this 
> issue was never moved to spark 2.x when databrick's spark-csv was merged into 
> spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 
> was fixed in spark-csv after that merge.
> The problem is that if I try to write a dataframe that contains a date column 
> out to a csv using something like this
> repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
> .option("delimiter", "\t")
> .option("header", "false")
> .option("nullValue", "?")
> .option("dateFormat", "-MM-dd'T'HH:mm:ss")
> .option("escape", "\\")   
> .save(tempFileName)
> Then my unit test (which passed under spark 1.6.2) fails using the spark 
> 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a 
> date column.
> Expected "[2012-01-03T09:12:00
> ?
> 2015-02-23T18:00:]00", 
> but got 
> "[132561072000
> ?
> 1424743200]00"
> This means that while the null value is being correctly exported, the 
> specified dateFormat is not being used to format the date. Instead it looks 
> like number of seconds from epoch is being used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16597) DataFrame DateType is written as an int(Days since epoch) by csv writer

2016-08-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16597:

Fix Version/s: 2.1.0
   2.0.1

> DataFrame DateType is written as an int(Days since epoch) by csv writer
> ---
>
> Key: SPARK-16597
> URL: https://issues.apache.org/jira/browse/SPARK-16597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dean Chen
>  Labels: csv
> Fix For: 2.0.1, 2.1.0
>
>
> import java.sql.Date
> case class DateClass(date: java.sql.Date)
> val df = spark.createDataFrame(Seq(DateClass(new Date(1468774636000L
> df.write.csv("test.csv")
> file content is 16999, days since epoch instead of 7/17/16



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16216) CSV data source does not write date and timestamp correctly

2016-08-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16216.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> CSV data source does not write date and timestamp correctly
> ---
>
> Key: SPARK-16216
> URL: https://issues.apache.org/jira/browse/SPARK-16216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: releasenotes
> Fix For: 2.0.1, 2.1.0
>
>
> Currently, CSV data source write {{DateType}} and {{TimestampType}} as below:
> {code}
> ++
> |date|
> ++
> |14406372|
> |14144598|
> |14540400|
> ++
> {code}
> It would be nicer if it write dates and timestamps as a formatted string just 
> like JSON data sources.
> Also, CSV data source currently supports {{dateFormat}} option to read dates 
> and timestamps in a custom format. It might be better if this option can be 
> applied in writing as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-08-24 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17204:
---
Description: 
We use the OFF_HEAP storage level extensively with great success. We've tried 
off-heap storage with replication factor 2 and have always received exceptions 
on the executor side very shortly after starting the job. For example:

{code}
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9086
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

or

{code}
java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at 
com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 

[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-08-24 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436222#comment-15436222
 ] 

Saisai Shao commented on SPARK-17204:
-

Yes, I could reproduce this issue, but not constantly, sometimes it is OK 
without any exception.

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the OFF_HEAP storage level extensively with great success. We've tried 
> off-heap storage with replication factor 2 and have always received 
> exceptions on the executor side very shortly after starting the job. For 
> example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> 

[jira] [Comment Edited] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-08-24 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436218#comment-15436218
 ] 

Michael Allman edited comment on SPARK-17204 at 8/25/16 3:36 AM:
-

I would think that, but `sc.range(0, 0)` throws the exception, too. Are you 
able to reproduce the problem with `sc.range(0, 2)`?


was (Author: michael):
I would think that, but `sc.range(0, 0)` throws the exception, too.

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the OFF_HEAP storage level extensively with great success. We've tried 
> off-heap storage with replication factor 2 and have always received 
> exceptions on the executor side very shortly after starting the job. For 
> example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>  

[jira] [Comment Edited] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-08-24 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436218#comment-15436218
 ] 

Michael Allman edited comment on SPARK-17204 at 8/25/16 3:37 AM:
-

I would think that, but {{sc.range(0, 0)}} throws the exception, too. Are you 
able to reproduce the problem with {{sc.range(0, 2)}}?


was (Author: michael):
I would think that, but `sc.range(0, 0)` throws the exception, too. Are you 
able to reproduce the problem with `sc.range(0, 2)`?

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the OFF_HEAP storage level extensively with great success. We've tried 
> off-heap storage with replication factor 2 and have always received 
> exceptions on the executor side very shortly after starting the job. For 
> example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> 

[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-08-24 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436218#comment-15436218
 ] 

Michael Allman commented on SPARK-17204:


I would think that, but `sc.range(0, 0)` throws the exception, too.

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the OFF_HEAP storage level extensively with great success. We've tried 
> off-heap storage with replication factor 2 and have always received 
> exceptions on the executor side very shortly after starting the job. For 
> example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> 

[jira] [Updated] (SPARK-17233) Shuffle file will be left over the capacity when dynamic schedule is enabled in a long running case.

2016-08-24 Thread carlmartin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

carlmartin updated SPARK-17233:
---
Description: 
When I execute some sql statement periodically in the long running 
thriftserver, I found the disk device will be full after about one week later.
After check the file on linux, I found so many shuffle files left on the 
block-mgr dir whose shuffle stage had finished long time ago.
Finally I find when it's need to clean shuffle file, driver will total each 
executor to do the ShuffleClean. But when dynamic schedule is enabled, executor 
will be down itself and executor can't clean its shuffle file, then file was 
left.

I test it in Spark 1.5 but master branch must have this issue.



  was:
When I execute some sql statement periodically in the long running 
thriftserver, I found the disk device will be full after about one week later.
After check the file on linux, I found so many shuffle files left on the 
block-mgr dir whose shuffle stage had finished long time ago.
Finally I find when it's need to clean shuffle file, driver will total each 
executor to do the ShuffleClean. But when dynamic schedule is enabled, executor 
will be down itself and executor can't clean its shuffle file, then file was 
left.




> Shuffle file will be left over the capacity when dynamic schedule is enabled 
> in a long running case.
> 
>
> Key: SPARK-17233
> URL: https://issues.apache.org/jira/browse/SPARK-17233
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.2, 2.0.0
>Reporter: carlmartin
>
> When I execute some sql statement periodically in the long running 
> thriftserver, I found the disk device will be full after about one week later.
> After check the file on linux, I found so many shuffle files left on the 
> block-mgr dir whose shuffle stage had finished long time ago.
> Finally I find when it's need to clean shuffle file, driver will total each 
> executor to do the ShuffleClean. But when dynamic schedule is enabled, 
> executor will be down itself and executor can't clean its shuffle file, then 
> file was left.
> I test it in Spark 1.5 but master branch must have this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-08-24 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17204:
---
Description: 
We use the OFF_HEAP storage level extensively with great success. We've tried 
off-heap storage with replication factor 2 and have always received exceptions 
on the executor side very shortly after starting the job. For example:

{code}
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9086
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

or

{code}
java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at 
com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 

[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-08-24 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436213#comment-15436213
 ] 

Saisai Shao commented on SPARK-17204:
-

I think to reflect the issue {{sc.range(0, 0)}} should be changed to 
{{sc.range(0, 2)}}, {{range(0, 0)}} actually persist nothing to memory.

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the OFF_HEAP storage level extensively with great success. We've tried 
> off-heap storage with replication factor 2 and have always received 
> exceptions on the executor side very shortly after starting the job. For 
> example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> 

[jira] [Updated] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-08-24 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17204:
---
Description: 
We use the OFF_HEAP storage level extensively with great success. We've tried 
off-heap storage with replication factor 2 and have always received exceptions 
on the executor side very shortly after starting the job. For example:

{code}
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9086
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

or

{code}
java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at 
com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 

[jira] [Created] (SPARK-17233) Shuffle file will be left over the capacity when dynamic schedule is enabled in a long running case.

2016-08-24 Thread carlmartin (JIRA)
carlmartin created SPARK-17233:
--

 Summary: Shuffle file will be left over the capacity when dynamic 
schedule is enabled in a long running case.
 Key: SPARK-17233
 URL: https://issues.apache.org/jira/browse/SPARK-17233
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0, 1.6.2, 1.5.2
Reporter: carlmartin


When I execute some sql statement periodically in the long running 
thriftserver, I found the disk device will be full after about one week later.
After check the file on linux, I found so many shuffle files left on the 
block-mgr dir whose shuffle stage had finished long time ago.
Finally I find when it's need to clean shuffle file, driver will total each 
executor to do the ShuffleClean. But when dynamic schedule is enabled, executor 
will be down itself and executor can't clean its shuffle file, then file was 
left.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-08-24 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436205#comment-15436205
 ] 

Saisai Shao commented on SPARK-17204:
-

No, I tested in yarn cluster, not local mode.

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the OFF_HEAP storage level extensively. We've tried off-heap storage 
> with replication factor 2 and have always received exceptions on the executor 
> side very shortly after starting the job. For example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 

[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:22 AM:
--

Exposing a {{family}} or similar parameter sounds good to me.
One question:
{quote}
When the family is set to "binomial" we produce normal logistic regression with 
pivoting and when it is set to "multinomial" (default) it produces logistic 
regression with pivoting. 
{quote}
Should it be {{when it is set to "multinomial" (default) it produces logistic 
regression {color:red}without{color} pivoting}} ? Thanks!



was (Author: yanboliang):
Exposing a {{family}} or similar parameter sounds good to me.

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:22 AM:
--

Exposing a {{family}} or similar parameter sounds good to me.
One more question:
{quote}
When the family is set to "binomial" we produce normal logistic regression with 
pivoting and when it is set to "multinomial" (default) it produces logistic 
regression with pivoting. 
{quote}
Should it be {{when it is set to "multinomial" (default) it produces logistic 
regression {color:red}without{color} pivoting}} ? Thanks!



was (Author: yanboliang):
Exposing a {{family}} or similar parameter sounds good to me.
One question:
{quote}
When the family is set to "binomial" we produce normal logistic regression with 
pivoting and when it is set to "multinomial" (default) it produces logistic 
regression with pivoting. 
{quote}
Should it be {{when it is set to "multinomial" (default) it produces logistic 
regression {color:red}without{color} pivoting}} ? Thanks!


> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-08-24 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436202#comment-15436202
 ] 

Michael Allman commented on SPARK-17204:


Hi [~jerryshao]. I wonder if you're testing in local mode? I only see this 
problem when running with remote executors on a cluster. When I run in local 
mode, I see a bunch of warnings like:

{code}
...
16/08/25 03:13:55 WARN storage.BlockManager: Block rdd_3_8 replicated to only 0 
peer(s) instead of 1 peers
16/08/25 03:13:55 WARN storage.BlockManager: Block rdd_3_15 replicated to only 
0 peer(s) instead of 1 peers
16/08/25 03:13:55 WARN storage.BlockManager: Block rdd_3_9 replicated to only 0 
peer(s) instead of 1 peers
...
{code}

These messages suggest to me no actual replication is being attempted, and 
that's why the problem is not manifested. To answer your other question, the 
test case I provided was something very simple I came up with after discovering 
this problem. My coworker was reading data from parquet files when I 
cut-n-pasted those stack traces.

I'll clarify these points in the description.

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the OFF_HEAP storage level extensively. We've tried off-heap storage 
> with replication factor 2 and have always received exceptions on the executor 
> side very shortly after starting the job. For example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> 

[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:14 AM:
--

Exposing a {{family}} or similar parameter sounds good to me.


was (Author: yanboliang):
Exposing a {{family}} or similar parameter to control pivoting sounds good to 
me.

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412
 ] 

Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:12 AM:
--

I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will be consistent with other ML models such as {{NaiveBayesModel}} which is 
also support multi-class classification. But this will introduce breaking 
change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs. FYI: SPARK-11834 and SPARK-11543.
* Model store/load compatibility.

Here we have two choice: consolidate them which will introduce breaking change; 
or keep them separately.
-I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!-


was (Author: yanboliang):
I think it's hard to unify binary and multinomial logistic regression if we do 
not make any breaking change.
* Like [~sethah] said, we need to find a way to unify the representation of 
{{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is 
still compromise, the best representation should be matrix for {{coefficients}} 
and vector for {{intercept}} even it's a binary classification problem. This 
will be consistent with other ML models such as {{NaiveBayesModel}} which is 
also support multi-class classification. But this will introduce big breaking 
change.
* MLOR and LOR return different result for binary classification when 
regularization is used.
* Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for 
binary logistic regression and they have some interactions. If we make MLOR and 
LOR share the old LOR code base, it will also introduce breaking change for 
these APIs. FYI: SPARK-11834 and SPARK-11543.
* Model store/load compatibility.

Here we have two choice: consolidate them which will introduce breaking change; 
or keep them separately.
-I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly 
hold my opinion if you have better proposal. Thanks!-

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191
 ] 

Yanbo Liang commented on SPARK-17163:
-

Exposing a {{family}} or similar parameter to control pivoting sounds good to 
me.

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2016-08-24 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436179#comment-15436179
 ] 

Saisai Shao commented on SPARK-17204:
-

It works OK in my local test with latest build:

{code}
val OFF_HEAP_2 = StorageLevel(useDisk = true, useMemory = true, useOffHeap = 
true, deserialized = false, replication = 2)
sc.range(0, 0).persist(OFF_HEAP_2).count
{code}

Also I'm curious why SparkSQL related code will be involved according to the 
exception you pasted above, are you using {{SparkSession#range}} instead. Also 
tested Dataset persist with {{OFF_HEAP_2}}, it also works fine without 
exception. 



> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>
> We use the OFF_HEAP storage level extensively. We've tried off-heap storage 
> with replication factor 2 and have always received exceptions on the executor 
> side very shortly after starting the job. For example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> 

[jira] [Commented] (SPARK-15382) monotonicallyIncreasingId doesn't work when data is upsampled

2016-08-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436172#comment-15436172
 ] 

Apache Spark commented on SPARK-15382:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/14800

> monotonicallyIncreasingId doesn't work when data is upsampled
> -
>
> Key: SPARK-15382
> URL: https://issues.apache.org/jira/browse/SPARK-15382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Mateusz Buśkiewicz
> Fix For: 2.0.1, 2.1.0
>
>
> Assigned ids are not unique
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import monotonicallyIncreasingId
> hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, 
> 10.0).withColumn('id', monotonicallyIncreasingId()).collect()
> {code}
> Output:
> {code}
> [Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15382) monotonicallyIncreasingId doesn't work when data is upsampled

2016-08-24 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436171#comment-15436171
 ] 

Takeshi Yamamuro commented on SPARK-15382:
--

Sorry, but the master still has this bug.
I made a pr, so could you check this?

> monotonicallyIncreasingId doesn't work when data is upsampled
> -
>
> Key: SPARK-15382
> URL: https://issues.apache.org/jira/browse/SPARK-15382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Mateusz Buśkiewicz
> Fix For: 2.0.1, 2.1.0
>
>
> Assigned ids are not unique
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import monotonicallyIncreasingId
> hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, 
> 10.0).withColumn('id', monotonicallyIncreasingId()).collect()
> {code}
> Output:
> {code}
> [Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17226) Allow defining multiple date formats per column in csv

2016-08-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436158#comment-15436158
 ] 

Hyukjin Kwon commented on SPARK-17226:
--

Codes to reproduce and suggestion maybe rather than just pointing the codes? I 
guess it is arguable how to deal with this. There is actually already an issue 
about this in external CSV library, 
https://github.com/databricks/spark-csv/pull/359

> Allow defining multiple date formats per column in csv
> --
>
> Key: SPARK-17226
> URL: https://issues.apache.org/jira/browse/SPARK-17226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Useful to have fallbacks in case of messy input and different columns can 
> have different formats.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L106



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17222) Support multline csv records

2016-08-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436154#comment-15436154
 ] 

Hyukjin Kwon commented on SPARK-17222:
--

Here is *related* PR https://github.com/apache/spark/pull/13007 and *related* 
issue SPARK-15226

> Support multline csv records
> 
>
> Key: SPARK-17222
> URL: https://issues.apache.org/jira/browse/SPARK-17222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>
> Below should be read as one record and currently it won't be since files and 
> records are split on new line.
> {code}
> "aaa","bb
> b","ccc"
> {code}
> This shouldn't be default behaviour due to performance but should be 
> configurable if necessary



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17227) Allow configuring record delimiter in csv

2016-08-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436150#comment-15436150
 ] 

Hyukjin Kwon commented on SPARK-17227:
--

Ah, SPARK-17222 is about miltiple-lines but IMHO it might have been nicer to 
summarize those in the single JIRA because I guess a single PR would fix all 
the listed JIRAs.

> Allow configuring record delimiter in csv
> -
>
> Key: SPARK-17227
> URL: https://issues.apache.org/jira/browse/SPARK-17227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Instead of hard coded "\n"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17227) Allow configuring record delimiter in csv

2016-08-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436147#comment-15436147
 ] 

Hyukjin Kwon commented on SPARK-17227:
--

We may have to open a JIRA to deal with multiple-lines first. The root cause is 
the use of {{LineRecordReader}} and for this reason, JSON also forces the 
format to be in-line-document. I got rid of the weird inconsistent behavior in 
{{CSVParser}} in current master anyway. So, Spark's CSV datasource does not 
support any multiple line stuff if my understanding is correct.

> Allow configuring record delimiter in csv
> -
>
> Key: SPARK-17227
> URL: https://issues.apache.org/jira/browse/SPARK-17227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Instead of hard coded "\n"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-24 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436143#comment-15436143
 ] 

Vincent commented on SPARK-17219:
-

for this scenario, we can add a new parameter for QuantileDiscretizer, a 
nullStrategy param as Berry mentioned. Actually, R supports such kind of option 
by having a "na.rm"  flag for user to either remove NaN elements before 
quantile, or throw an error (by default). So, I think it's a nice thing to have 
in Spark too.

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17227) Allow configuring record delimiter in csv

2016-08-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436140#comment-15436140
 ] 

Hyukjin Kwon commented on SPARK-17227:
--

Also, it would be great if the JIRA has an example and problem so that this can 
be tested and reproduced.

> Allow configuring record delimiter in csv
> -
>
> Key: SPARK-17227
> URL: https://issues.apache.org/jira/browse/SPARK-17227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Instead of hard coded "\n"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17227) Allow configuring record delimiter in csv

2016-08-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436139#comment-15436139
 ] 

Hyukjin Kwon commented on SPARK-17227:
--

If I remember this correctly, we are not using that {{rowSeparator}} anymore 
(although this is set into the parser) and I think that should be removed as 
CSV datasource uses {{LineRecordReader}} for each line.

> Allow configuring record delimiter in csv
> -
>
> Key: SPARK-17227
> URL: https://issues.apache.org/jira/browse/SPARK-17227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Instead of hard coded "\n"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-24 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436136#comment-15436136
 ] 

Vincent commented on SPARK-17219:
-

for cases where only null and non-null buckets are needed, I guess we dont need 
to call QuantileDiscretizer to do that

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16216) CSV data source does not write date and timestamp correctly

2016-08-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436124#comment-15436124
 ] 

Apache Spark commented on SPARK-16216:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14799

> CSV data source does not write date and timestamp correctly
> ---
>
> Key: SPARK-16216
> URL: https://issues.apache.org/jira/browse/SPARK-16216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: releasenotes
>
> Currently, CSV data source write {{DateType}} and {{TimestampType}} as below:
> {code}
> ++
> |date|
> ++
> |14406372|
> |14144598|
> |14540400|
> ++
> {code}
> It would be nicer if it write dates and timestamps as a formatted string just 
> like JSON data sources.
> Also, CSV data source currently supports {{dateFormat}} option to read dates 
> and timestamps in a custom format. It might be better if this option can be 
> applied in writing as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17110) Pyspark with locality ANY throw java.io.StreamCorruptedException

2016-08-24 Thread Gen TANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436120#comment-15436120
 ] 

Gen TANG commented on SPARK-17110:
--

[~radost...@gmail.com], It seems spark scala doesn't have this bug in version 
2.0.0

> Pyspark with locality ANY throw java.io.StreamCorruptedException
> 
>
> Key: SPARK-17110
> URL: https://issues.apache.org/jira/browse/SPARK-17110
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: Cluster of 2 AWS r3.xlarge nodes launched via ec2 
> scripts, Spark 2.0.0, hadoop: yarn, pyspark shell
>Reporter: Tomer Kaftan
>Priority: Critical
>
> In Pyspark 2.0.0, any task that accesses cached data non-locally throws a 
> StreamCorruptedException like the stacktrace below:
> {noformat}
> WARN TaskSetManager: Lost task 7.0 in stage 2.0 (TID 26, 172.31.26.184): 
> java.io.StreamCorruptedException: invalid stream header: 12010A80
> at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:807)
> at java.io.ObjectInputStream.(ObjectInputStream.java:302)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122)
> at 
> org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:146)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:524)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:522)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522)
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The simplest way I have found to reproduce this is by running the following 
> code in the pyspark shell, on a cluster of 2 nodes set to use only one worker 
> core each:
> {code}
> x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache()
> x.count()
> import time
> def waitMap(x):
> time.sleep(x)
> return x
> x.map(waitMap).count()
> {code}
> Or by running the following via spark-submit:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache()
> x.count()
> import time
> def waitMap(x):
> time.sleep(x)
> return x
> x.map(waitMap).count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-24 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436033#comment-15436033
 ] 

Seth Hendrickson commented on SPARK-17163:
--

BTW, I am happy to take care of merging the interfaces if we decide to. I did a 
bit of exploration today and it seems like it will be rather straightforward.

> Decide on unified multinomial and binary logistic regression interfaces
> ---
>
> Key: SPARK-17163
> URL: https://issues.apache.org/jira/browse/SPARK-17163
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Before the 2.1 release, we should finalize the API for logistic regression. 
> After SPARK-7159, we have both LogisticRegression and 
> MultinomialLogisticRegression models. This may be confusing to users and, is 
> a bit superfluous since MLOR can do basically all of what BLOR does. We 
> should decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17156) Add multiclass logistic regression Scala Example

2016-08-24 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436004#comment-15436004
 ] 

Miao Wang commented on SPARK-17156:
---

Two quick comments:
1). Add some comments like in the LogisticRegressionExample;
2). For the dataset, can you use similar dataset as in the tests? 

> Add multiclass logistic regression Scala Example
> 
>
> Key: SPARK-17156
> URL: https://issues.apache.org/jira/browse/SPARK-17156
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Reporter: Miao Wang
>
> As [SPARK-7159][ML] Add multiclass logistic regression to Spark ML  has been 
> merged to master, we should add Scala example of using multiclass logistic 
> regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper

2016-08-24 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435996#comment-15435996
 ] 

Miao Wang commented on SPARK-17157:
---

Start working on it now. 

> Add multiclass logistic regression SparkR Wrapper
> -
>
> Key: SPARK-17157
> URL: https://issues.apache.org/jira/browse/SPARK-17157
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Miao Wang
>
> [SPARK-7159][ML] Add multiclass logistic regression to Spark ML  has been 
> merged to Master. I open this JIRA for discussion of adding SparkR wrapper 
> for multiclass logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17231:


Assignee: (was: Apache Spark)

> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Priority: Minor
> Attachments: logging_perf_improvements 2.jpg, 
> logging_perf_improvements.jpg, master 2.jpg, master.jpg
>
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the {{network-common}} and 
> {{network-shuffle}} code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17231:


Assignee: Apache Spark

> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Assignee: Apache Spark
>Priority: Minor
> Attachments: logging_perf_improvements 2.jpg, 
> logging_perf_improvements.jpg, master 2.jpg, master.jpg
>
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the {{network-common}} and 
> {{network-shuffle}} code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435958#comment-15435958
 ] 

Apache Spark commented on SPARK-17231:
--

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/14798

> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Priority: Minor
> Attachments: logging_perf_improvements 2.jpg, 
> logging_perf_improvements.jpg, master 2.jpg, master.jpg
>
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the {{network-common}} and 
> {{network-shuffle}} code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17231:
---
Description: While debugging the performance of a large GraphX connected 
components computation, I found several places in the {{network-common}} and 
{{network-shuffle}} code bases where trace or debug log messages are 
constructed even if the respective log level is disabled. Refactoring the 
respective code to avoid these constructions except where necessary led to a 
modest but measurable reduction in task time, GC time and the ratio thereof.  
(was: While debugging the performance of a large GraphX connected components 
computation, I found several places in the {{network-common}} and 
{{network-shuffle}} code bases where trace or debug log messages are 
constructed even if the respective log level is disabled. Refactoring the 
respective code to avoid these constructions except where necessary led to a 
modest but measurable reduction in task time, GC time and the ratio thereof.

(PR to come.))

> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Priority: Minor
> Attachments: logging_perf_improvements 2.jpg, 
> logging_perf_improvements.jpg, master 2.jpg, master.jpg
>
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the {{network-common}} and 
> {{network-shuffle}} code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435767#comment-15435767
 ] 

Michael Allman edited comment on SPARK-17231 at 8/24/16 11:29 PM:
--

I've attached screenshots from four independent test runs on four different EC2 
clusters with the same configuration. The only differences are that the master 
files are from test runs without the logging patches and the 
logging_perf_improvements files are from test runs with the logging patches.


was (Author: michael):
Note that in the attached screenshots, all stats are the same except task and 
gc time.

> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Priority: Minor
> Attachments: logging_perf_improvements 2.jpg, 
> logging_perf_improvements.jpg, master 2.jpg, master.jpg
>
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the {{network-common}} and 
> {{network-shuffle}} code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.
> (PR to come.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17231:
---
Attachment: master 2.jpg
logging_perf_improvements 2.jpg

> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Priority: Minor
> Attachments: logging_perf_improvements 2.jpg, 
> logging_perf_improvements.jpg, master 2.jpg, master.jpg
>
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the {{network-common}} and 
> {{network-shuffle}} code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.
> (PR to come.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17232) Expecting same behavior after loading a dataframe with dots in column name

2016-08-24 Thread Louis Salin (JIRA)
Louis Salin created SPARK-17232:
---

 Summary: Expecting same behavior after loading a dataframe with 
dots in column name
 Key: SPARK-17232
 URL: https://issues.apache.org/jira/browse/SPARK-17232
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Louis Salin


In Spark 2.0, the behavior of a dataframe changes after saving and reloading it 
when there are dots in the column names. In the example below, I was able to 
call the {{rdd}} function for a newly created dataframe. However, after saving 
it and reloading it, an exception gets thrown when calling the {{rdd}} function.

from a spark-shell:

{{scala> val simpleDf = Seq((1, 2)).toDF("a.b", "a.c")}}
Res1: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\]

{{scala> simpleDf.rdd}}
Res2: org.apache.spark.rdd.RDD\[org.apache.spark.sql.Row\] = 
MapPartitionsRDD\[7\] at rdd at :29

{{scala> simpleDf.write.parquet("/user/lsalin/simpleDf")}}

{{scala> val readDf = spark.read.parquet("/user/lsalin/simpleDf")}}
Res4: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\]

{{scala> readDf.rdd}}

{noformat}
org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b, a.c];
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129)
  at 
org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
  at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
  at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
  at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
  at 
org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
  at 
org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
  at 
org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
  at 
org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at 
org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
  at 
org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
 

[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17231:
---
Description: 
While debugging the performance of a large GraphX connected components 
computation, I found several places in the {{network-common}} and 
{{network-shuffle}} code bases where trace or debug log messages are 
constructed even if the respective log level is disabled. Refactoring the 
respective code to avoid these constructions except where necessary led to a 
modest but measurable reduction in task time, GC time and the ratio thereof.

(PR to come.)

  was:While debugging the performance of a large GraphX connected components 
computation, I found several places in the {{network-common}} and 
{{network-shuffle}} code bases where trace or debug log messages are 
constructed even if the respective log level is disabled. Refactoring the 
respective code to avoid these constructions except where necessary led to a 
modest but measurable reduction in task time, GC time and the ratio thereof.


> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Priority: Minor
> Attachments: logging_perf_improvements.jpg, master.jpg
>
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the {{network-common}} and 
> {{network-shuffle}} code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.
> (PR to come.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17231:
---
Description: While debugging the performance of a large GraphX connected 
components computation, I found several places in the {{network-common}} and 
{{network-shuffle}} code bases where trace or debug log messages are 
constructed even if the respective log level is disabled. Refactoring the 
respective code to avoid these constructions except where necessary led to a 
modest but measurable reduction in task time, GC time and the ratio thereof.  
(was: While debugging the performance of a large GraphX connected components 
computation, I found several places in the `network-common` and 
`network-shuffle` code bases where trace or debug log messages are constructed 
even if the respective log level is disabled. Refactoring the respective code 
to avoid these constructions except where necessary led to a modest but 
measurable reduction in task time, GC time and the ratio thereof.)

> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Priority: Minor
> Attachments: logging_perf_improvements.jpg, master.jpg
>
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the {{network-common}} and 
> {{network-shuffle}} code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17123) Performing set operations that combine string and date / timestamp columns may result in generated projection code which doesn't compile

2016-08-24 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435773#comment-15435773
 ] 

Dongjoon Hyun commented on SPARK-17123:
---

Hi, [~joshrosen].
I'll make a PR for this.

> Performing set operations that combine string and date / timestamp columns 
> may result in generated projection code which doesn't compile
> 
>
> Key: SPARK-17123
> URL: https://issues.apache.org/jira/browse/SPARK-17123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> The following example program causes SpecificSafeProjection code generation 
> to produce Java code which doesn't compile:
> {code}
> import org.apache.spark.sql.types._
> spark.sql("set spark.sql.codegen.fallback=false")
> val dateDF = spark.createDataFrame(sc.parallelize(Seq(Row(new 
> java.sql.Date(0, StructType(StructField("value", DateType) :: Nil))
> val longDF = sc.parallelize(Seq(new java.sql.Date(0).toString)).toDF
> dateDF.union(longDF).collect()
> {code}
> This fails at runtime with the following error:
> {code}
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 28, Column 107: No applicable constructor/method found 
> for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates 
> are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private org.apache.spark.sql.types.StructType schema;
> /* 011 */
> /* 012 */
> /* 013 */   public SpecificSafeProjection(Object[] references) {
> /* 014 */ this.references = references;
> /* 015 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 016 */
> /* 017 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 018 */   }
> /* 019 */
> /* 020 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 021 */ InternalRow i = (InternalRow) _i;
> /* 022 */
> /* 023 */ values = new Object[1];
> /* 024 */
> /* 025 */ boolean isNull2 = i.isNullAt(0);
> /* 026 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
> /* 027 */ boolean isNull1 = isNull2;
> /* 028 */ final java.sql.Date value1 = isNull1 ? null : 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2);
> /* 029 */ isNull1 = value1 == null;
> /* 030 */ if (isNull1) {
> /* 031 */   values[0] = null;
> /* 032 */ } else {
> /* 033 */   values[0] = value1;
> /* 034 */ }
> /* 035 */
> /* 036 */ final org.apache.spark.sql.Row value = new 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, 
> schema);
> /* 037 */ if (false) {
> /* 038 */   mutableRow.setNullAt(0);
> /* 039 */ } else {
> /* 040 */
> /* 041 */   mutableRow.update(0, value);
> /* 042 */ }
> /* 043 */
> /* 044 */ return mutableRow;
> /* 045 */   }
> /* 046 */ }
> {code}
> Here, the invocation of {{DateTimeUtils.toJavaDate}} is incorrect because the 
> generated code tries to call it with a UTF8String while the method expects an 
> int instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17231:
---
Description: While debugging the performance of a large GraphX connected 
components computation, I found several places in the `network-common` and 
`network-shuffle` code bases where trace or debug log messages are constructed 
even if the respective log level is disabled. Refactoring the respective code 
to avoid these constructions except where necessary led to a modest but 
measurable reduction in task time, GC time and the ratio thereof.  (was: While 
debugging the performance of a large GraphX connected components computation, I 
found several places in the `network-common` and `network-shuffle` code bases 
where trace or debug log messages are constructed even if the respective log 
level is disabled. Refactoring the respective code to avoid these constructions 
except where necessary led to a modest but measurable reduction in task time, 
GC time and the ratio thereof.

(PR to follow.))

> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Priority: Minor
> Attachments: logging_perf_improvements.jpg, master.jpg
>
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the `network-common` and 
> `network-shuffle` code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17231:
---
Description: 
While debugging the performance of a large GraphX connected components 
computation, I found several places in the `network-common` and 
`network-shuffle` code bases where trace or debug log messages are constructed 
even if the respective log level is disabled. Refactoring the respective code 
to avoid these constructions except where necessary led to a modest but 
measurable reduction in task time, GC time and the ratio thereof.

(PR to follow.)

  was:
While debugging the performance of a large GraphX connected components 
computation, I found several places in the `network-common` and 
`network-shuffle` code bases where trace or debug log messages are constructed 
even if the respective log level is disabled. Refactoring the respective code 
to avoid these constructions except where necessary led to a modest but 
measurable reduction in task time, GC time and the ratio thereof.

(Before and after executor stats to follow in screenshots.)

(PR to follow.)


> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Priority: Minor
> Attachments: logging_perf_improvements.jpg, master.jpg
>
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the `network-common` and 
> `network-shuffle` code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.
> (PR to follow.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435767#comment-15435767
 ] 

Michael Allman commented on SPARK-17231:


Note that in the attached screenshots, all stats are the same except task and 
gc time.

> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Priority: Minor
> Attachments: logging_perf_improvements.jpg, master.jpg
>
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the `network-common` and 
> `network-shuffle` code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.
> (Before and after executor stats to follow in screenshots.)
> (PR to follow.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17231:
---
Attachment: logging_perf_improvements.jpg
master.jpg

> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Priority: Minor
> Attachments: logging_perf_improvements.jpg, master.jpg
>
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the `network-common` and 
> `network-shuffle` code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.
> (Before and after executor stats to follow in screenshots.)
> (PR to follow.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Michael Allman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-17231:
---
Description: 
While debugging the performance of a large GraphX connected components 
computation, I found several places in the `network-common` and 
`network-shuffle` code bases where trace or debug log messages are constructed 
even if the respective log level is disabled. Refactoring the respective code 
to avoid these constructions except where necessary led to a modest but 
measurable reduction in task time, GC time and the ratio thereof.

(Before and after executor stats to follow in screenshots.)

(PR to follow.)

  was:
While debugging the performance of a large GraphX connected components 
computation, I found several places in the `network-common` and 
`network-shuffle` code bases where trace or debug log messages are constructed 
even if the respective log level is disabled. Refactoring the respective code 
to avoid these constructions except where necessary led to a modest but 
measurable performance improvement.

(Before and after executor stats to follow in screenshots.)

(PR to follow.)


> Avoid building debug or trace log messages unless the respective log level is 
> enabled
> -
>
> Key: SPARK-17231
> URL: https://issues.apache.org/jira/browse/SPARK-17231
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: Spark cluster with 8 r3.8xl EC2 worker instances
>Reporter: Michael Allman
>Priority: Minor
>
> While debugging the performance of a large GraphX connected components 
> computation, I found several places in the `network-common` and 
> `network-shuffle` code bases where trace or debug log messages are 
> constructed even if the respective log level is disabled. Refactoring the 
> respective code to avoid these constructions except where necessary led to a 
> modest but measurable reduction in task time, GC time and the ratio thereof.
> (Before and after executor stats to follow in screenshots.)
> (PR to follow.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled

2016-08-24 Thread Michael Allman (JIRA)
Michael Allman created SPARK-17231:
--

 Summary: Avoid building debug or trace log messages unless the 
respective log level is enabled
 Key: SPARK-17231
 URL: https://issues.apache.org/jira/browse/SPARK-17231
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.0
 Environment: Spark cluster with 8 r3.8xl EC2 worker instances
Reporter: Michael Allman
Priority: Minor


While debugging the performance of a large GraphX connected components 
computation, I found several places in the `network-common` and 
`network-shuffle` code bases where trace or debug log messages are constructed 
even if the respective log level is disabled. Refactoring the respective code 
to avoid these constructions except where necessary led to a modest but 
measurable performance improvement.

(Before and after executor stats to follow in screenshots.)

(PR to follow.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17211) Broadcast join produces incorrect results

2016-08-24 Thread Himanish Kushary (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435745#comment-15435745
 ] 

Himanish Kushary commented on SPARK-17211:
--

I ran the following in a Databricks environment with Spark 2.0. Works fine.

{code:java}
import spark.implicits._

val a1 = Array((123,1),(234,2),(432,5))
val a2 = Array(("abc",1),("bcd",2),("dcb",5))
val df1 = sc.parallelize(a1).toDF("gid","id")
val df2 = sc.parallelize(a2).toDF("gname","id")
df1.join(df2,"id").show() // WORKS
+---+---+-+
| id|gid|gname|
+---+---+-+
|  5|432|  dcb|
|  2|234|  bcd|
|  1|123|  abc|
+---+---+-+
df1.join(broadcast(df2),"id").show() // BROADCASTING - DOES NOT WORK on EMR
+---+---+-+
| id|gid|gname|
+---+---+-+
|  1|123| null|
|  2|234| null|
|  5|432| null|
+---+---+-+
broadcast(df1).join(df2,"id").show() // BROADCASTING - DOES NOT WORK on EMR
{code}

> Broadcast join produces incorrect results
> -
>
> Key: SPARK-17211
> URL: https://issues.apache.org/jira/browse/SPARK-17211
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Jarno Seppanen
>
> Broadcast join produces incorrect columns in join result, see below for an 
> example. The same join but without using broadcast gives the correct columns.
> Running PySpark on YARN on Amazon EMR 5.0.0.
> {noformat}
> import pyspark.sql.functions as func
> keys = [
> (5400, 0),
> (5401, 1),
> (5402, 2),
> ]
> keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1)
> keys_df.show()
> # ++-+
> # |  key_id|value|
> # ++-+
> # |5400|0|
> # |5401|1|
> # |5402|2|
> # ++-+
> data = [
> (5402,1),
> (5400,2),
> (5401,3),
> ]
> data_df = spark.createDataFrame(data, ['key_id', 'foo'])
> data_df.show()
> # ++---+  
> 
> # |  key_id|foo|
> # ++---+
> # |5402|  1|
> # |5400|  2|
> # |5401|  3|
> # ++---+
> ### INCORRECT ###
> data_df.join(func.broadcast(keys_df), 'key_id').show()
> # ++---++ 
> 
> # |  key_id|foo|   value|
> # ++---++
> # |5402|  1|5402|
> # |5400|  2|5400|
> # |5401|  3|5401|
> # ++---++
> ### CORRECT ###
> data_df.join(keys_df, 'key_id').show()
> # ++---+-+
> # |  key_id|foo|value|
> # ++---+-+
> # |5400|  2|0|
> # |5401|  3|1|
> # |5402|  1|2|
> # ++---+-+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17230) Writing decimal to csv will result empty string if the decimal exceeds (20, 18)

2016-08-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17230:


Assignee: Davies Liu  (was: Apache Spark)

> Writing decimal to csv will result empty string if the decimal exceeds (20, 
> 18)
> ---
>
> Key: SPARK-17230
> URL: https://issues.apache.org/jira/browse/SPARK-17230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> {code}
> // file content 
> spark.read.csv("/mnt/djiang/test-case.csv").show 
> // read in as string and create temp view 
> spark.read.csv("/mnt/djiang/test-case.csv").createOrReplaceTempView("test") 
> // confirm schema 
> spark.table("test").printSchema 
> // apply decimal calculation, confirm the result is correct 
> spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) 
> from test").show(false) 
> // run the same query, and write out as csv 
> spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) 
> from test").write.csv("/mnt/djiang/test-case-result") 
> // show the content of the result file, particularly, for number exceeded 
> decimal(20, 18), the csv is not writing anything or failing silently 
> spark.read.csv("/mnt/djiang/test-case-result").show
> +--+ 
> | _c0| 
> +--+ 
> | 1| 
> | 10| 
> | 100| 
> | 1000| 
> | 1| 
> |10| 
> +--+
> root 
> |-- _c0: string (nullable = true)
> +--+-+
>  
> |_c0 |(CAST(CAST(CAST(CAST(_c0 AS DECIMAL(20,0)) AS BIGINT) AS DECIMAL(20,0)) 
> AS DECIMAL(38,18)) * CAST(CAST(1.0 AS DECIMAL(38,18)) AS DECIMAL(38,18)))| 
> +--+-+
>  
> |1 |1.00 | 
> |10 |10.00 | 
> |100 |100.00 | 
> |1000 |1000.00 | 
> |1 |1.00 |
> |10|10.00 | 
> +--+-+
> +--++ 
> | _c0| _c1| 
> +--++ 
> | 1|1.00| 
> | 10|10.00...| 
> | 100| | 
> | 1000| | 
> | 1| | 
> |10| | 
> +--++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17230) Writing decimal to csv will result empty string if the decimal exceeds (20, 18)

2016-08-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17230:


Assignee: Apache Spark  (was: Davies Liu)

> Writing decimal to csv will result empty string if the decimal exceeds (20, 
> 18)
> ---
>
> Key: SPARK-17230
> URL: https://issues.apache.org/jira/browse/SPARK-17230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> {code}
> // file content 
> spark.read.csv("/mnt/djiang/test-case.csv").show 
> // read in as string and create temp view 
> spark.read.csv("/mnt/djiang/test-case.csv").createOrReplaceTempView("test") 
> // confirm schema 
> spark.table("test").printSchema 
> // apply decimal calculation, confirm the result is correct 
> spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) 
> from test").show(false) 
> // run the same query, and write out as csv 
> spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) 
> from test").write.csv("/mnt/djiang/test-case-result") 
> // show the content of the result file, particularly, for number exceeded 
> decimal(20, 18), the csv is not writing anything or failing silently 
> spark.read.csv("/mnt/djiang/test-case-result").show
> +--+ 
> | _c0| 
> +--+ 
> | 1| 
> | 10| 
> | 100| 
> | 1000| 
> | 1| 
> |10| 
> +--+
> root 
> |-- _c0: string (nullable = true)
> +--+-+
>  
> |_c0 |(CAST(CAST(CAST(CAST(_c0 AS DECIMAL(20,0)) AS BIGINT) AS DECIMAL(20,0)) 
> AS DECIMAL(38,18)) * CAST(CAST(1.0 AS DECIMAL(38,18)) AS DECIMAL(38,18)))| 
> +--+-+
>  
> |1 |1.00 | 
> |10 |10.00 | 
> |100 |100.00 | 
> |1000 |1000.00 | 
> |1 |1.00 |
> |10|10.00 | 
> +--+-+
> +--++ 
> | _c0| _c1| 
> +--++ 
> | 1|1.00| 
> | 10|10.00...| 
> | 100| | 
> | 1000| | 
> | 1| | 
> |10| | 
> +--++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17230) Writing decimal to csv will result empty string if the decimal exceeds (20, 18)

2016-08-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435725#comment-15435725
 ] 

Apache Spark commented on SPARK-17230:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/14797

> Writing decimal to csv will result empty string if the decimal exceeds (20, 
> 18)
> ---
>
> Key: SPARK-17230
> URL: https://issues.apache.org/jira/browse/SPARK-17230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> {code}
> // file content 
> spark.read.csv("/mnt/djiang/test-case.csv").show 
> // read in as string and create temp view 
> spark.read.csv("/mnt/djiang/test-case.csv").createOrReplaceTempView("test") 
> // confirm schema 
> spark.table("test").printSchema 
> // apply decimal calculation, confirm the result is correct 
> spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) 
> from test").show(false) 
> // run the same query, and write out as csv 
> spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) 
> from test").write.csv("/mnt/djiang/test-case-result") 
> // show the content of the result file, particularly, for number exceeded 
> decimal(20, 18), the csv is not writing anything or failing silently 
> spark.read.csv("/mnt/djiang/test-case-result").show
> +--+ 
> | _c0| 
> +--+ 
> | 1| 
> | 10| 
> | 100| 
> | 1000| 
> | 1| 
> |10| 
> +--+
> root 
> |-- _c0: string (nullable = true)
> +--+-+
>  
> |_c0 |(CAST(CAST(CAST(CAST(_c0 AS DECIMAL(20,0)) AS BIGINT) AS DECIMAL(20,0)) 
> AS DECIMAL(38,18)) * CAST(CAST(1.0 AS DECIMAL(38,18)) AS DECIMAL(38,18)))| 
> +--+-+
>  
> |1 |1.00 | 
> |10 |10.00 | 
> |100 |100.00 | 
> |1000 |1000.00 | 
> |1 |1.00 |
> |10|10.00 | 
> +--+-+
> +--++ 
> | _c0| _c1| 
> +--++ 
> | 1|1.00| 
> | 10|10.00...| 
> | 100| | 
> | 1000| | 
> | 1| | 
> |10| | 
> +--++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation

2016-08-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17120:


Assignee: Apache Spark

> Analyzer incorrectly optimizes plan to empty LocalRelation
> --
>
> Key: SPARK-17120
> URL: https://issues.apache.org/jira/browse/SPARK-17120
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
>
> Consider the following query:
> {code}
> sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> println(sql("""
>   SELECT
>   *
>   FROM (
>   SELECT
>   COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
>   FROM table_3 t1
>   LEFT JOIN table_4 t2 ON false
>   ) t where (t.int_col) is not null
> """).collect().toSeq)
> {code}
> In the innermost query, the LEFT JOIN's condition is {{false}} but 
> nevertheless the number of rows produced should equal the number of rows in 
> {{table_3}} (which is non-empty). Since no values are {{null}}, the outer 
> {{where}} should retain all rows, so the overall result of this query should 
> contain a single row with the value '97'.
> Instead, the current Spark master (as of 
> 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking 
> at {{explain}}, it appears that the logical plan is optimizing to 
> {{LocalRelation }}, so Spark doesn't even run the query. My suspicion 
> is that there's a bug in constraint propagation or filter pushdown.
> This issue doesn't seem to affect Spark 2.0, so I think it's a regression in 
> master. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation

2016-08-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17120:


Assignee: (was: Apache Spark)

> Analyzer incorrectly optimizes plan to empty LocalRelation
> --
>
> Key: SPARK-17120
> URL: https://issues.apache.org/jira/browse/SPARK-17120
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Consider the following query:
> {code}
> sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> println(sql("""
>   SELECT
>   *
>   FROM (
>   SELECT
>   COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
>   FROM table_3 t1
>   LEFT JOIN table_4 t2 ON false
>   ) t where (t.int_col) is not null
> """).collect().toSeq)
> {code}
> In the innermost query, the LEFT JOIN's condition is {{false}} but 
> nevertheless the number of rows produced should equal the number of rows in 
> {{table_3}} (which is non-empty). Since no values are {{null}}, the outer 
> {{where}} should retain all rows, so the overall result of this query should 
> contain a single row with the value '97'.
> Instead, the current Spark master (as of 
> 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking 
> at {{explain}}, it appears that the logical plan is optimizing to 
> {{LocalRelation }}, so Spark doesn't even run the query. My suspicion 
> is that there's a bug in constraint propagation or filter pushdown.
> This issue doesn't seem to affect Spark 2.0, so I think it's a regression in 
> master. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17226) Allow defining multiple date formats per column in csv

2016-08-24 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435709#comment-15435709
 ] 

Robert Kruszewski commented on SPARK-17226:
---

Anything in particular you have in mind? I should have defined components and 
versions beforehand, that's my mistake. It's useful to have an issue for 
tracking in case anyone else comes looking for it.

> Allow defining multiple date formats per column in csv
> --
>
> Key: SPARK-17226
> URL: https://issues.apache.org/jira/browse/SPARK-17226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Useful to have fallbacks in case of messy input and different columns can 
> have different formats.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L106



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17099) Incorrect result when HAVING clause is added to group by query

2016-08-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17099:


Assignee: (was: Apache Spark)

> Incorrect result when HAVING clause is added to group by query
> --
>
> Key: SPARK-17099
> URL: https://issues.apache.org/jira/browse/SPARK-17099
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Random query generation uncovered the following query which returns incorrect 
> results when run on Spark SQL. This wasn't the original query uncovered by 
> the generator, since I performed a bit of minimization to try to make it more 
> understandable.
> With the following tables:
> {code}
> val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5")
> val t2 = sc.parallelize(
>   Seq(
> (-769, -244),
> (-800, -409),
> (940, 86),
> (-507, 304),
> (-367, 158))
> ).toDF("int_col_2", "int_col_5")
> t1.registerTempTable("t1")
> t2.registerTempTable("t2")
> {code}
> Run
> {code}
> SELECT
>   (SUM(COALESCE(t1.int_col_5, t2.int_col_2))),
>  ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
> FROM t1
> RIGHT JOIN t2
>   ON (t2.int_col_2) = (t1.int_col_5)
> GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)),
>  COALESCE(t1.int_col_5, t2.int_col_2)
> HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, 
> t2.int_col_2)) * 2)
> {code}
> In Spark SQL, this returns an empty result set, whereas Postgres returns four 
> rows. However, if I omit the {{HAVING}} clause I see that the group's rows 
> are being incorrectly filtered by the {{HAVING}} clause:
> {code}
> +--+---+--+
> | sum(coalesce(int_col_5, int_col_2))  | (coalesce(int_col_5, int_col_2) * 2) 
>  |
> +--+---+--+
> | -507 | -1014
>  |
> | 940  | 1880 
>  |
> | -769 | -1538
>  |
> | -367 | -734 
>  |
> | -800 | -1600
>  |
> +--+---+--+
> {code}
> Based on this, the output after adding the {{HAVING}} should contain four 
> rows, not zero.
> I'm not sure how to further shrink this in a straightforward way, so I'm 
> opening this bug to get help in triaging further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation

2016-08-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435714#comment-15435714
 ] 

Apache Spark commented on SPARK-17120:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14661

> Analyzer incorrectly optimizes plan to empty LocalRelation
> --
>
> Key: SPARK-17120
> URL: https://issues.apache.org/jira/browse/SPARK-17120
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Consider the following query:
> {code}
> sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> println(sql("""
>   SELECT
>   *
>   FROM (
>   SELECT
>   COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
>   FROM table_3 t1
>   LEFT JOIN table_4 t2 ON false
>   ) t where (t.int_col) is not null
> """).collect().toSeq)
> {code}
> In the innermost query, the LEFT JOIN's condition is {{false}} but 
> nevertheless the number of rows produced should equal the number of rows in 
> {{table_3}} (which is non-empty). Since no values are {{null}}, the outer 
> {{where}} should retain all rows, so the overall result of this query should 
> contain a single row with the value '97'.
> Instead, the current Spark master (as of 
> 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking 
> at {{explain}}, it appears that the logical plan is optimizing to 
> {{LocalRelation }}, so Spark doesn't even run the query. My suspicion 
> is that there's a bug in constraint propagation or filter pushdown.
> This issue doesn't seem to affect Spark 2.0, so I think it's a regression in 
> master. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17099) Incorrect result when HAVING clause is added to group by query

2016-08-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435711#comment-15435711
 ] 

Apache Spark commented on SPARK-17099:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14661

> Incorrect result when HAVING clause is added to group by query
> --
>
> Key: SPARK-17099
> URL: https://issues.apache.org/jira/browse/SPARK-17099
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Random query generation uncovered the following query which returns incorrect 
> results when run on Spark SQL. This wasn't the original query uncovered by 
> the generator, since I performed a bit of minimization to try to make it more 
> understandable.
> With the following tables:
> {code}
> val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5")
> val t2 = sc.parallelize(
>   Seq(
> (-769, -244),
> (-800, -409),
> (940, 86),
> (-507, 304),
> (-367, 158))
> ).toDF("int_col_2", "int_col_5")
> t1.registerTempTable("t1")
> t2.registerTempTable("t2")
> {code}
> Run
> {code}
> SELECT
>   (SUM(COALESCE(t1.int_col_5, t2.int_col_2))),
>  ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
> FROM t1
> RIGHT JOIN t2
>   ON (t2.int_col_2) = (t1.int_col_5)
> GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)),
>  COALESCE(t1.int_col_5, t2.int_col_2)
> HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, 
> t2.int_col_2)) * 2)
> {code}
> In Spark SQL, this returns an empty result set, whereas Postgres returns four 
> rows. However, if I omit the {{HAVING}} clause I see that the group's rows 
> are being incorrectly filtered by the {{HAVING}} clause:
> {code}
> +--+---+--+
> | sum(coalesce(int_col_5, int_col_2))  | (coalesce(int_col_5, int_col_2) * 2) 
>  |
> +--+---+--+
> | -507 | -1014
>  |
> | 940  | 1880 
>  |
> | -769 | -1538
>  |
> | -367 | -734 
>  |
> | -800 | -1600
>  |
> +--+---+--+
> {code}
> Based on this, the output after adding the {{HAVING}} should contain four 
> rows, not zero.
> I'm not sure how to further shrink this in a straightforward way, so I'm 
> opening this bug to get help in triaging further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17099) Incorrect result when HAVING clause is added to group by query

2016-08-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17099:


Assignee: Apache Spark

> Incorrect result when HAVING clause is added to group by query
> --
>
> Key: SPARK-17099
> URL: https://issues.apache.org/jira/browse/SPARK-17099
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
>
> Random query generation uncovered the following query which returns incorrect 
> results when run on Spark SQL. This wasn't the original query uncovered by 
> the generator, since I performed a bit of minimization to try to make it more 
> understandable.
> With the following tables:
> {code}
> val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5")
> val t2 = sc.parallelize(
>   Seq(
> (-769, -244),
> (-800, -409),
> (940, 86),
> (-507, 304),
> (-367, 158))
> ).toDF("int_col_2", "int_col_5")
> t1.registerTempTable("t1")
> t2.registerTempTable("t2")
> {code}
> Run
> {code}
> SELECT
>   (SUM(COALESCE(t1.int_col_5, t2.int_col_2))),
>  ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2)
> FROM t1
> RIGHT JOIN t2
>   ON (t2.int_col_2) = (t1.int_col_5)
> GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)),
>  COALESCE(t1.int_col_5, t2.int_col_2)
> HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, 
> t2.int_col_2)) * 2)
> {code}
> In Spark SQL, this returns an empty result set, whereas Postgres returns four 
> rows. However, if I omit the {{HAVING}} clause I see that the group's rows 
> are being incorrectly filtered by the {{HAVING}} clause:
> {code}
> +--+---+--+
> | sum(coalesce(int_col_5, int_col_2))  | (coalesce(int_col_5, int_col_2) * 2) 
>  |
> +--+---+--+
> | -507 | -1014
>  |
> | 940  | 1880 
>  |
> | -769 | -1538
>  |
> | -367 | -734 
>  |
> | -800 | -1600
>  |
> +--+---+--+
> {code}
> Based on this, the output after adding the {{HAVING}} should contain four 
> rows, not zero.
> I'm not sure how to further shrink this in a straightforward way, so I'm 
> opening this bug to get help in triaging further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17226) Allow defining multiple date formats per column in csv

2016-08-24 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-17226:
--
Description: 
Useful to have fallbacks in case of messy input and different columns can have 
different formats.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L106

  was:Useful to have fallbacks in case of messy input and different columns can 
have different formats.


> Allow defining multiple date formats per column in csv
> --
>
> Key: SPARK-17226
> URL: https://issues.apache.org/jira/browse/SPARK-17226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Useful to have fallbacks in case of messy input and different columns can 
> have different formats.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L106



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17225) Support multiple null values in csv files

2016-08-24 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-17225:
--
Component/s: SQL

> Support multiple null values in csv files
> -
>
> Key: SPARK-17225
> URL: https://issues.apache.org/jira/browse/SPARK-17225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Since we're dealing with strings it's useful to have multiple different 
> representations of null values as data might not be fully normalized



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17224) Support skipping multiple header rows in csv

2016-08-24 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-17224:
--
Component/s: SQL

> Support skipping multiple header rows in csv
> 
>
> Key: SPARK-17224
> URL: https://issues.apache.org/jira/browse/SPARK-17224
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Headers can be multiline and sometimes you want to skip multiple rows because 
> of the format you've been given



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17225) Support multiple null values in csv files

2016-08-24 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-17225:
--
Affects Version/s: 2.0.0

> Support multiple null values in csv files
> -
>
> Key: SPARK-17225
> URL: https://issues.apache.org/jira/browse/SPARK-17225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Since we're dealing with strings it's useful to have multiple different 
> representations of null values as data might not be fully normalized



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17226) Allow defining multiple date formats per column in csv

2016-08-24 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-17226:
--
Component/s: SQL

> Allow defining multiple date formats per column in csv
> --
>
> Key: SPARK-17226
> URL: https://issues.apache.org/jira/browse/SPARK-17226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Useful to have fallbacks in case of messy input and different columns can 
> have different formats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17224) Support skipping multiple header rows in csv

2016-08-24 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-17224:
--
Affects Version/s: 2.0.0

> Support skipping multiple header rows in csv
> 
>
> Key: SPARK-17224
> URL: https://issues.apache.org/jira/browse/SPARK-17224
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Headers can be multiline and sometimes you want to skip multiple rows because 
> of the format you've been given



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17226) Allow defining multiple date formats per column in csv

2016-08-24 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-17226:
--
Affects Version/s: 2.0.0

> Allow defining multiple date formats per column in csv
> --
>
> Key: SPARK-17226
> URL: https://issues.apache.org/jira/browse/SPARK-17226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Useful to have fallbacks in case of messy input and different columns can 
> have different formats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17227) Allow configuring record delimiter in csv

2016-08-24 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-17227:
--
Affects Version/s: 2.0.0

> Allow configuring record delimiter in csv
> -
>
> Key: SPARK-17227
> URL: https://issues.apache.org/jira/browse/SPARK-17227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Instead of hard coded "\n"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17227) Allow configuring record delimiter in csv

2016-08-24 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-17227:
--
Component/s: SQL

> Allow configuring record delimiter in csv
> -
>
> Key: SPARK-17227
> URL: https://issues.apache.org/jira/browse/SPARK-17227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Instead of hard coded "\n"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17222) Support multline csv records

2016-08-24 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-17222:
--
Affects Version/s: 2.0.0

> Support multline csv records
> 
>
> Key: SPARK-17222
> URL: https://issues.apache.org/jira/browse/SPARK-17222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>
> Below should be read as one record and currently it won't be since files and 
> records are split on new line.
> {code}
> "aaa","bb
> b","ccc"
> {code}
> This shouldn't be default behaviour due to performance but should be 
> configurable if necessary



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17222) Support multline csv records

2016-08-24 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-17222:
--
Component/s: SQL

> Support multline csv records
> 
>
> Key: SPARK-17222
> URL: https://issues.apache.org/jira/browse/SPARK-17222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Robert Kruszewski
>
> Below should be read as one record and currently it won't be since files and 
> records are split on new line.
> {code}
> "aaa","bb
> b","ccc"
> {code}
> This shouldn't be default behaviour due to performance but should be 
> configurable if necessary



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16334) SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException

2016-08-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16334:
--
   Labels:   (was: sql)
Fix Version/s: (was: 2.0.1)
   (was: 2.1.0)
  Component/s: SQL
  Summary: SQL query on parquet table 
java.lang.ArrayIndexOutOfBoundsException  (was: [SQL] SQL query on parquet 
table java.lang.ArrayIndexOutOfBoundsException)

> SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-16334
> URL: https://issues.apache.org/jira/browse/SPARK-16334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Sameer Agarwal
>Priority: Critical
>
> Query:
> {code}
> select * from blabla where user_id = 415706251
> {code}
> Error:
> {code}
> 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 
> (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934
> at 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Work on 1.6.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17229) Postgres JDBC dialect should not widen float and short types during reads

2016-08-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17229:


Assignee: Apache Spark  (was: Josh Rosen)

> Postgres JDBC dialect should not widen float and short types during reads
> -
>
> Key: SPARK-17229
> URL: https://issues.apache.org/jira/browse/SPARK-17229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Minor
>
> When reading {{float4}} and {{smallint}} columns from PostgreSQL, Spark's 
> Postgres dialect widens these types to Decimal and Integer rather than using 
> the narrower Float and Short types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17229) Postgres JDBC dialect should not widen float and short types during reads

2016-08-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435665#comment-15435665
 ] 

Apache Spark commented on SPARK-17229:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14796

> Postgres JDBC dialect should not widen float and short types during reads
> -
>
> Key: SPARK-17229
> URL: https://issues.apache.org/jira/browse/SPARK-17229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> When reading {{float4}} and {{smallint}} columns from PostgreSQL, Spark's 
> Postgres dialect widens these types to Decimal and Integer rather than using 
> the narrower Float and Short types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17229) Postgres JDBC dialect should not widen float and short types during reads

2016-08-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17229:


Assignee: Josh Rosen  (was: Apache Spark)

> Postgres JDBC dialect should not widen float and short types during reads
> -
>
> Key: SPARK-17229
> URL: https://issues.apache.org/jira/browse/SPARK-17229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> When reading {{float4}} and {{smallint}} columns from PostgreSQL, Spark's 
> Postgres dialect widens these types to Decimal and Integer rather than using 
> the narrower Float and Short types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17230) Writing decimal to csv will result empty string if the decimal exceeds (20, 18)

2016-08-24 Thread Davies Liu (JIRA)
Davies Liu created SPARK-17230:
--

 Summary: Writing decimal to csv will result empty string if the 
decimal exceeds (20, 18)
 Key: SPARK-17230
 URL: https://issues.apache.org/jira/browse/SPARK-17230
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0, 1.6.2
Reporter: Davies Liu
Assignee: Davies Liu


{code}
// file content 
spark.read.csv("/mnt/djiang/test-case.csv").show 
// read in as string and create temp view 
spark.read.csv("/mnt/djiang/test-case.csv").createOrReplaceTempView("test") 
// confirm schema 
spark.table("test").printSchema 
// apply decimal calculation, confirm the result is correct 
spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) from 
test").show(false) 
// run the same query, and write out as csv 
spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) from 
test").write.csv("/mnt/djiang/test-case-result") 
// show the content of the result file, particularly, for number exceeded 
decimal(20, 18), the csv is not writing anything or failing silently 
spark.read.csv("/mnt/djiang/test-case-result").show

+--+ 
| _c0| 
+--+ 
| 1| 
| 10| 
| 100| 
| 1000| 
| 1| 
|10| 
+--+

root 
|-- _c0: string (nullable = true)

+--+-+
 
|_c0 |(CAST(CAST(CAST(CAST(_c0 AS DECIMAL(20,0)) AS BIGINT) AS DECIMAL(20,0)) 
AS DECIMAL(38,18)) * CAST(CAST(1.0 AS DECIMAL(38,18)) AS DECIMAL(38,18)))| 
+--+-+
 
|1 |1.00 | 
|10 |10.00 | 
|100 |100.00 | 
|1000 |1000.00 | 
|1 |1.00 |
|10|10.00 | 
+--+-+

+--++ 
| _c0| _c1| 
+--++ 
| 1|1.00| 
| 10|10.00...| 
| 100| | 
| 1000| | 
| 1| | 
|10| | 
+--++
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-08-24 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435651#comment-15435651
 ] 

Barry Becker commented on SPARK-17219:
--

If the decision is to have an additional null/NaN bucket, then I agree that 
other choices aren't needed. 
I agree that that the null/NaN bucket can be separate from maxBins (i.e. 
request 10, but get 11).
A couple of other things to consider:
- I think there should always be a null/NaN bucket present for the same reason 
that the first and last bins are -Inf and +Inf respectively. Just because there 
were no nulls in the training/fitting data does not mean that they will not 
come through later and need to be placed somewhere.
-  Currently validation fails if there are fewer than 3 splits specified for a 
Bucketizer. I actually think that 2 splits should be the minimum - even though 
that means only 1 bucket! The reason is that some algorithms (like Naive Bayes) 
may choose to bin features (using MDLP discretization for example) into just 2 
buckets - null and non-null. If we now have a null bucket always present, we 
may just want a single [-Inf, Inf] bucket to for non-nulls - as strange at that 
sounds.


> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17229) Postgres JDBC dialect should not widen float and short types during reads

2016-08-24 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-17229:
--

 Summary: Postgres JDBC dialect should not widen float and short 
types during reads
 Key: SPARK-17229
 URL: https://issues.apache.org/jira/browse/SPARK-17229
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Minor


When reading {{float4}} and {{smallint}} columns from PostgreSQL, Spark's 
Postgres dialect widens these types to Decimal and Integer rather than using 
the narrower Float and Short types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >