[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436335#comment-15436335 ] DB Tsai commented on SPARK-17163: - I voted for merging into one interface as well. Since binary LOR can be represented as matrix like MLOR we can always return matrix and intercepts for BLOR and MLOR. For BLOR, I feel like flatten the matrix and set intercept as zero is too hacky, and we could just throw exception. Finally, we can set two classes problem default to pivoting, and classes larger than 2 use MLOR without pivoting. I like what Yanbo suggested, and we can default to auto. Users can make it binomial or multinomial. Thanks. > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436326#comment-15436326 ] Seth Hendrickson commented on SPARK-17163: -- Good catch, thanks! > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17201) Investigate numerical instability for MLOR without regularization
[ https://issues.apache.org/jira/browse/SPARK-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436318#comment-15436318 ] DB Tsai commented on SPARK-17201: - This makes sense. Let's keep an eye on this, and figure out the interface first. Patching this with pivoting is relative easy, and can be done in a way that the model format is not changed by unpivoting the coefficients and center them again. Thanks. > Investigate numerical instability for MLOR without regularization > - > > Key: SPARK-17201 > URL: https://issues.apache.org/jira/browse/SPARK-17201 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > As mentioned > [here|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression], when no > regularization is applied in Softmax regression, second order Newton solvers > may run into numerical instability problems. We should investigate this in > practice and find a solution, possibly by implementing pivoting when no > regularization is applied. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-17232) Expecting same behavior after loading a dataframe with dots in column name
[ https://issues.apache.org/jira/browse/SPARK-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jagadeesan A S updated SPARK-17232: --- Comment: was deleted (was: I'm not able to reproduce the issue. {code:xml} scala> readDf.rdd res2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] at rdd at :26 scala> readDf.show() +---+---+ |a.b|a.c| +---+---+ | 1| 2| +---+---+ {code} ) > Expecting same behavior after loading a dataframe with dots in column name > -- > > Key: SPARK-17232 > URL: https://issues.apache.org/jira/browse/SPARK-17232 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Louis Salin > > In Spark 2.0, the behavior of a dataframe changes after saving and reloading > it when there are dots in the column names. In the example below, I was able > to call the {{rdd}} function for a newly created dataframe. However, after > saving it and reloading it, an exception gets thrown when calling the {{rdd}} > function. > from a spark-shell: > {{scala> val simpleDf = Seq((1, 2)).toDF("a.b", "a.c")}} > Res1: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\] > {{scala> simpleDf.rdd}} > Res2: org.apache.spark.rdd.RDD\[org.apache.spark.sql.Row\] = > MapPartitionsRDD\[7\] at rdd at :29 > {{scala> simpleDf.write.parquet("/user/lsalin/simpleDf")}} > {{scala> val readDf = spark.read.parquet("/user/lsalin/simpleDf")}} > Res4: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\] > {{scala> readDf.rdd}} > {noformat} > org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b, > a.c]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:95) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129) > at > org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61) > at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48) > at >
[jira] [Commented] (SPARK-17232) Expecting same behavior after loading a dataframe with dots in column name
[ https://issues.apache.org/jira/browse/SPARK-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436289#comment-15436289 ] Jagadeesan A S commented on SPARK-17232: I'm not able to reproduce the issue. {code:xml} scala> readDf.rdd res2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] at rdd at :26 scala> readDf.show() +---+---+ |a.b|a.c| +---+---+ | 1| 2| +---+---+ {code} > Expecting same behavior after loading a dataframe with dots in column name > -- > > Key: SPARK-17232 > URL: https://issues.apache.org/jira/browse/SPARK-17232 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Louis Salin > > In Spark 2.0, the behavior of a dataframe changes after saving and reloading > it when there are dots in the column names. In the example below, I was able > to call the {{rdd}} function for a newly created dataframe. However, after > saving it and reloading it, an exception gets thrown when calling the {{rdd}} > function. > from a spark-shell: > {{scala> val simpleDf = Seq((1, 2)).toDF("a.b", "a.c")}} > Res1: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\] > {{scala> simpleDf.rdd}} > Res2: org.apache.spark.rdd.RDD\[org.apache.spark.sql.Row\] = > MapPartitionsRDD\[7\] at rdd at :29 > {{scala> simpleDf.write.parquet("/user/lsalin/simpleDf")}} > {{scala> val readDf = spark.read.parquet("/user/lsalin/simpleDf")}} > Res4: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\] > {{scala> readDf.rdd}} > {noformat} > org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b, > a.c]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:95) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129) > at > org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61) > at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48) > at >
[jira] [Comment Edited] (SPARK-17232) Expecting same behavior after loading a dataframe with dots in column name
[ https://issues.apache.org/jira/browse/SPARK-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436285#comment-15436285 ] Jagadeesan A S edited comment on SPARK-17232 at 8/25/16 5:01 AM: - I'm not able to reproduce the issue. {code:xml} scala> readDf.rdd res2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] at rdd at :26 scala> readDf.show() +---+---+ |a.b|a.c| +---+---+ | 1| 2| +---+---+ {code} was (Author: as2): I'm not able to reproduce the issue. {code:xml} scala> readDf.rdd res2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] at rdd at :26 {code} > Expecting same behavior after loading a dataframe with dots in column name > -- > > Key: SPARK-17232 > URL: https://issues.apache.org/jira/browse/SPARK-17232 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Louis Salin > > In Spark 2.0, the behavior of a dataframe changes after saving and reloading > it when there are dots in the column names. In the example below, I was able > to call the {{rdd}} function for a newly created dataframe. However, after > saving it and reloading it, an exception gets thrown when calling the {{rdd}} > function. > from a spark-shell: > {{scala> val simpleDf = Seq((1, 2)).toDF("a.b", "a.c")}} > Res1: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\] > {{scala> simpleDf.rdd}} > Res2: org.apache.spark.rdd.RDD\[org.apache.spark.sql.Row\] = > MapPartitionsRDD\[7\] at rdd at :29 > {{scala> simpleDf.write.parquet("/user/lsalin/simpleDf")}} > {{scala> val readDf = spark.read.parquet("/user/lsalin/simpleDf")}} > Res4: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\] > {{scala> readDf.rdd}} > {noformat} > org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b, > a.c]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:95) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129) > at > org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61) > at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
[jira] [Commented] (SPARK-17232) Expecting same behavior after loading a dataframe with dots in column name
[ https://issues.apache.org/jira/browse/SPARK-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436285#comment-15436285 ] Jagadeesan A S commented on SPARK-17232: I'm not able to reproduce the issue. {code:xml} scala> readDf.rdd res2: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] at rdd at :26 {code} > Expecting same behavior after loading a dataframe with dots in column name > -- > > Key: SPARK-17232 > URL: https://issues.apache.org/jira/browse/SPARK-17232 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Louis Salin > > In Spark 2.0, the behavior of a dataframe changes after saving and reloading > it when there are dots in the column names. In the example below, I was able > to call the {{rdd}} function for a newly created dataframe. However, after > saving it and reloading it, an exception gets thrown when calling the {{rdd}} > function. > from a spark-shell: > {{scala> val simpleDf = Seq((1, 2)).toDF("a.b", "a.c")}} > Res1: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\] > {{scala> simpleDf.rdd}} > Res2: org.apache.spark.rdd.RDD\[org.apache.spark.sql.Row\] = > MapPartitionsRDD\[7\] at rdd at :29 > {{scala> simpleDf.write.parquet("/user/lsalin/simpleDf")}} > {{scala> val readDf = spark.read.parquet("/user/lsalin/simpleDf")}} > Res4: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\] > {{scala> readDf.rdd}} > {noformat} > org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b, > a.c]; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:95) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129) > at > org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61) > at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48) > at > org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48) > at
[jira] [Updated] (SPARK-17190) Removal of HiveSharedState
[ https://issues.apache.org/jira/browse/SPARK-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-17190: Assignee: Xiao Li > Removal of HiveSharedState > -- > > Key: SPARK-17190 > URL: https://issues.apache.org/jira/browse/SPARK-17190 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.0 > > > Since `HiveClient` is used to interact with the Hive metastore, it should be > hidden in `HiveExternalCatalog`. After moving `HiveClient` into > `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of > `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes > straightforward. After removal of `HiveSharedState`, the reflection logic is > directly applied on the choice of `ExternalCatalog` types, based on the > configuration of `CATALOG_IMPLEMENTATION`. > `HiveClient` is also used/invoked by the other entities besides > HiveExternalCatalog, we defines the following two APIs: > {noformat} > /** >* Return the existing [[HiveClient]] used to interact with the metastore. >*/ > def getClient: HiveClient > /** >* Return a [[HiveClient]] as a new session >*/ > def getNewClient: HiveClient > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17190) Removal of HiveSharedState
[ https://issues.apache.org/jira/browse/SPARK-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17190. - Issue resolved by pull request 14757 [https://github.com/apache/spark/pull/14757] > Removal of HiveSharedState > -- > > Key: SPARK-17190 > URL: https://issues.apache.org/jira/browse/SPARK-17190 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > Fix For: 2.1.0 > > > Since `HiveClient` is used to interact with the Hive metastore, it should be > hidden in `HiveExternalCatalog`. After moving `HiveClient` into > `HiveExternalCatalog`, `HiveSharedState` becomes a wrapper of > `HiveExternalCatalog`. Thus, removal of `HiveSharedState` becomes > straightforward. After removal of `HiveSharedState`, the reflection logic is > directly applied on the choice of `ExternalCatalog` types, based on the > configuration of `CATALOG_IMPLEMENTATION`. > `HiveClient` is also used/invoked by the other entities besides > HiveExternalCatalog, we defines the following two APIs: > {noformat} > /** >* Return the existing [[HiveClient]] used to interact with the metastore. >*/ > def getClient: HiveClient > /** >* Return a [[HiveClient]] as a new session >*/ > def getNewClient: HiveClient > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14381) Review spark.ml parity for feature transformers
[ https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14381: Fix Version/s: (was: 2.1.0) > Review spark.ml parity for feature transformers > --- > > Key: SPARK-14381 > URL: https://issues.apache.org/jira/browse/SPARK-14381 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Xusen Yin > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers
[ https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436268#comment-15436268 ] Yanbo Liang commented on SPARK-14381: - Resolved this, thanks for working on it. > Review spark.ml parity for feature transformers > --- > > Key: SPARK-14381 > URL: https://issues.apache.org/jira/browse/SPARK-14381 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Xusen Yin > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14381) Review spark.ml parity for feature transformers
[ https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-14381. - Resolution: Done Assignee: Xusen Yin Fix Version/s: 2.1.0 > Review spark.ml parity for feature transformers > --- > > Key: SPARK-14381 > URL: https://issues.apache.org/jira/browse/SPARK-14381 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Xusen Yin > Fix For: 2.1.0 > > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264 ] Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:30 AM: -- * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** PMML SPARK-11239 ** toString: print summary * IsotonicRegressionModel ** single-row prediction SPARK-10413 * StreamingLinearRegression was (Author: yanboliang): * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** PMML SPARK-11237 ** toString: print summary * IsotonicRegressionModel ** single-row prediction SPARK-10413 * StreamingLinearRegression > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264 ] Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:29 AM: -- * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** PMML SPARK-11237 ** toString: print summary * IsotonicRegressionModel ** single-row prediction SPARK-10413 * StreamingLinearRegression was (Author: yanboliang): * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** toString: print summary * IsotonicRegressionModel ** single-row prediction SPARK-10413 > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264 ] Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:26 AM: -- * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** toString: print summary * IsotonicRegressionModel ** single-row prediction SPARK-10413 was (Author: yanboliang): * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** toString: print summary SPARK-14712 * IsotonicRegressionModel ** single-row prediction SPARK-10413 > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-14378: --- Assignee: Yanbo Liang > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264 ] Yanbo Liang edited comment on SPARK-14378 at 8/25/16 4:25 AM: -- * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel (Hold off until SPARK-10780 resolved) ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** toString: print summary SPARK-14712 * IsotonicRegressionModel ** single-row prediction SPARK-10413 was (Author: yanboliang): * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** toString: print summary SPARK-14712 * IsotonicRegressionModel ** single-row prediction SPARK-10413 > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17228) Not infer/propagate non-deterministic constraints
[ https://issues.apache.org/jira/browse/SPARK-17228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17228. - Resolution: Fixed Assignee: Sameer Agarwal Fix Version/s: 2.1.0 2.0.1 > Not infer/propagate non-deterministic constraints > - > > Key: SPARK-17228 > URL: https://issues.apache.org/jira/browse/SPARK-17228 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14378) Review spark.ml parity for regression, except trees
[ https://issues.apache.org/jira/browse/SPARK-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436264#comment-15436264 ] Yanbo Liang commented on SPARK-14378: - * GeneralizedLinearModel(LinearRegressionModel/RidgeRegressionModel/LassoRegressionModel) ** single-row prediction SPARK-10413 ** initialModel ** setValidateData (not important for public API) ** optimizer interface SPARK-17136 ** LBFGS set numCorrections (not important for public API) ** toString: print summary SPARK-14712 * IsotonicRegressionModel ** single-row prediction SPARK-10413 > Review spark.ml parity for regression, except trees > --- > > Key: SPARK-14378 > URL: https://issues.apache.org/jira/browse/SPARK-14378 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17066) dateFormat should be used when writing dataframes as csv files
[ https://issues.apache.org/jira/browse/SPARK-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17066: Fix Version/s: 2.1.0 2.0.1 > dateFormat should be used when writing dataframes as csv files > -- > > Key: SPARK-17066 > URL: https://issues.apache.org/jira/browse/SPARK-17066 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > Fix For: 2.0.1, 2.1.0 > > > I noticed this when running tests after pulling and building @lw-lin 's PR > (https://github.com/apache/spark/pull/14118). I don't think it is anything > wrong with his PR, just that the fix that was made to spark-csv for this > issue was never moved to spark 2.x when databrick's spark-csv was merged into > spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 > was fixed in spark-csv after that merge. > The problem is that if I try to write a dataframe that contains a date column > out to a csv using something like this > repartitionDf.write.format("csv") //.format(DATABRICKS_CSV) > .option("delimiter", "\t") > .option("header", "false") > .option("nullValue", "?") > .option("dateFormat", "-MM-dd'T'HH:mm:ss") > .option("escape", "\\") > .save(tempFileName) > Then my unit test (which passed under spark 1.6.2) fails using the spark > 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a > date column. > Expected "[2012-01-03T09:12:00 > ? > 2015-02-23T18:00:]00", > but got > "[132561072000 > ? > 1424743200]00" > This means that while the null value is being correctly exported, the > specified dateFormat is not being used to format the date. Instead it looks > like number of seconds from epoch is being used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16597) DataFrame DateType is written as an int(Days since epoch) by csv writer
[ https://issues.apache.org/jira/browse/SPARK-16597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16597: Fix Version/s: 2.1.0 2.0.1 > DataFrame DateType is written as an int(Days since epoch) by csv writer > --- > > Key: SPARK-16597 > URL: https://issues.apache.org/jira/browse/SPARK-16597 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Dean Chen > Labels: csv > Fix For: 2.0.1, 2.1.0 > > > import java.sql.Date > case class DateClass(date: java.sql.Date) > val df = spark.createDataFrame(Seq(DateClass(new Date(1468774636000L > df.write.csv("test.csv") > file content is 16999, days since epoch instead of 7/17/16 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16216) CSV data source does not write date and timestamp correctly
[ https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16216. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > CSV data source does not write date and timestamp correctly > --- > > Key: SPARK-16216 > URL: https://issues.apache.org/jira/browse/SPARK-16216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Blocker > Labels: releasenotes > Fix For: 2.0.1, 2.1.0 > > > Currently, CSV data source write {{DateType}} and {{TimestampType}} as below: > {code} > ++ > |date| > ++ > |14406372| > |14144598| > |14540400| > ++ > {code} > It would be nicer if it write dates and timestamps as a formatted string just > like JSON data sources. > Also, CSV data source currently supports {{dateFormat}} option to read dates > and timestamps in a custom format. It might be better if this option can be > applied in writing as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17204: --- Description: We use the OFF_HEAP storage level extensively with great success. We've tried off-heap storage with replication factor 2 and have always received exceptions on the executor side very shortly after starting the job. For example: {code} com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9086 at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} or {code} java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436222#comment-15436222 ] Saisai Shao commented on SPARK-17204: - Yes, I could reproduce this issue, but not constantly, sometimes it is OK without any exception. > Spark 2.0 off heap RDD persistence with replication factor 2 leads to > in-memory data corruption > --- > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the OFF_HEAP storage level extensively with great success. We've tried > off-heap storage with replication factor 2 and have always received > exceptions on the executor side very shortly after starting the job. For > example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at >
[jira] [Comment Edited] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436218#comment-15436218 ] Michael Allman edited comment on SPARK-17204 at 8/25/16 3:36 AM: - I would think that, but `sc.range(0, 0)` throws the exception, too. Are you able to reproduce the problem with `sc.range(0, 2)`? was (Author: michael): I would think that, but `sc.range(0, 0)` throws the exception, too. > Spark 2.0 off heap RDD persistence with replication factor 2 leads to > in-memory data corruption > --- > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the OFF_HEAP storage level extensively with great success. We've tried > off-heap storage with replication factor 2 and have always received > exceptions on the executor side very shortly after starting the job. For > example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) >
[jira] [Comment Edited] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436218#comment-15436218 ] Michael Allman edited comment on SPARK-17204 at 8/25/16 3:37 AM: - I would think that, but {{sc.range(0, 0)}} throws the exception, too. Are you able to reproduce the problem with {{sc.range(0, 2)}}? was (Author: michael): I would think that, but `sc.range(0, 0)` throws the exception, too. Are you able to reproduce the problem with `sc.range(0, 2)`? > Spark 2.0 off heap RDD persistence with replication factor 2 leads to > in-memory data corruption > --- > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the OFF_HEAP storage level extensively with great success. We've tried > off-heap storage with replication factor 2 and have always received > exceptions on the executor side very shortly after starting the job. For > example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at >
[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436218#comment-15436218 ] Michael Allman commented on SPARK-17204: I would think that, but `sc.range(0, 0)` throws the exception, too. > Spark 2.0 off heap RDD persistence with replication factor 2 leads to > in-memory data corruption > --- > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the OFF_HEAP storage level extensively with great success. We've tried > off-heap storage with replication factor 2 and have always received > exceptions on the executor side very shortly after starting the job. For > example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at >
[jira] [Updated] (SPARK-17233) Shuffle file will be left over the capacity when dynamic schedule is enabled in a long running case.
[ https://issues.apache.org/jira/browse/SPARK-17233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] carlmartin updated SPARK-17233: --- Description: When I execute some sql statement periodically in the long running thriftserver, I found the disk device will be full after about one week later. After check the file on linux, I found so many shuffle files left on the block-mgr dir whose shuffle stage had finished long time ago. Finally I find when it's need to clean shuffle file, driver will total each executor to do the ShuffleClean. But when dynamic schedule is enabled, executor will be down itself and executor can't clean its shuffle file, then file was left. I test it in Spark 1.5 but master branch must have this issue. was: When I execute some sql statement periodically in the long running thriftserver, I found the disk device will be full after about one week later. After check the file on linux, I found so many shuffle files left on the block-mgr dir whose shuffle stage had finished long time ago. Finally I find when it's need to clean shuffle file, driver will total each executor to do the ShuffleClean. But when dynamic schedule is enabled, executor will be down itself and executor can't clean its shuffle file, then file was left. > Shuffle file will be left over the capacity when dynamic schedule is enabled > in a long running case. > > > Key: SPARK-17233 > URL: https://issues.apache.org/jira/browse/SPARK-17233 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2, 1.6.2, 2.0.0 >Reporter: carlmartin > > When I execute some sql statement periodically in the long running > thriftserver, I found the disk device will be full after about one week later. > After check the file on linux, I found so many shuffle files left on the > block-mgr dir whose shuffle stage had finished long time ago. > Finally I find when it's need to clean shuffle file, driver will total each > executor to do the ShuffleClean. But when dynamic schedule is enabled, > executor will be down itself and executor can't clean its shuffle file, then > file was left. > I test it in Spark 1.5 but master branch must have this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17204: --- Description: We use the OFF_HEAP storage level extensively with great success. We've tried off-heap storage with replication factor 2 and have always received exceptions on the executor side very shortly after starting the job. For example: {code} com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9086 at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} or {code} java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436213#comment-15436213 ] Saisai Shao commented on SPARK-17204: - I think to reflect the issue {{sc.range(0, 0)}} should be changed to {{sc.range(0, 2)}}, {{range(0, 0)}} actually persist nothing to memory. > Spark 2.0 off heap RDD persistence with replication factor 2 leads to > in-memory data corruption > --- > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the OFF_HEAP storage level extensively with great success. We've tried > off-heap storage with replication factor 2 and have always received > exceptions on the executor side very shortly after starting the job. For > example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at >
[jira] [Updated] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17204: --- Description: We use the OFF_HEAP storage level extensively with great success. We've tried off-heap storage with replication factor 2 and have always received exceptions on the executor side very shortly after starting the job. For example: {code} com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9086 at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} or {code} java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
[jira] [Created] (SPARK-17233) Shuffle file will be left over the capacity when dynamic schedule is enabled in a long running case.
carlmartin created SPARK-17233: -- Summary: Shuffle file will be left over the capacity when dynamic schedule is enabled in a long running case. Key: SPARK-17233 URL: https://issues.apache.org/jira/browse/SPARK-17233 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0, 1.6.2, 1.5.2 Reporter: carlmartin When I execute some sql statement periodically in the long running thriftserver, I found the disk device will be full after about one week later. After check the file on linux, I found so many shuffle files left on the block-mgr dir whose shuffle stage had finished long time ago. Finally I find when it's need to clean shuffle file, driver will total each executor to do the ShuffleClean. But when dynamic schedule is enabled, executor will be down itself and executor can't clean its shuffle file, then file was left. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436205#comment-15436205 ] Saisai Shao commented on SPARK-17204: - No, I tested in yarn cluster, not local mode. > Spark 2.0 off heap RDD persistence with replication factor 2 leads to > in-memory data corruption > --- > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the OFF_HEAP storage level extensively. We've tried off-heap storage > with replication factor 2 and have always received exceptions on the executor > side very shortly after starting the job. For example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191 ] Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:22 AM: -- Exposing a {{family}} or similar parameter sounds good to me. One question: {quote} When the family is set to "binomial" we produce normal logistic regression with pivoting and when it is set to "multinomial" (default) it produces logistic regression with pivoting. {quote} Should it be {{when it is set to "multinomial" (default) it produces logistic regression {color:red}without{color} pivoting}} ? Thanks! was (Author: yanboliang): Exposing a {{family}} or similar parameter sounds good to me. > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191 ] Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:22 AM: -- Exposing a {{family}} or similar parameter sounds good to me. One more question: {quote} When the family is set to "binomial" we produce normal logistic regression with pivoting and when it is set to "multinomial" (default) it produces logistic regression with pivoting. {quote} Should it be {{when it is set to "multinomial" (default) it produces logistic regression {color:red}without{color} pivoting}} ? Thanks! was (Author: yanboliang): Exposing a {{family}} or similar parameter sounds good to me. One question: {quote} When the family is set to "binomial" we produce normal logistic regression with pivoting and when it is set to "multinomial" (default) it produces logistic regression with pivoting. {quote} Should it be {{when it is set to "multinomial" (default) it produces logistic regression {color:red}without{color} pivoting}} ? Thanks! > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436202#comment-15436202 ] Michael Allman commented on SPARK-17204: Hi [~jerryshao]. I wonder if you're testing in local mode? I only see this problem when running with remote executors on a cluster. When I run in local mode, I see a bunch of warnings like: {code} ... 16/08/25 03:13:55 WARN storage.BlockManager: Block rdd_3_8 replicated to only 0 peer(s) instead of 1 peers 16/08/25 03:13:55 WARN storage.BlockManager: Block rdd_3_15 replicated to only 0 peer(s) instead of 1 peers 16/08/25 03:13:55 WARN storage.BlockManager: Block rdd_3_9 replicated to only 0 peer(s) instead of 1 peers ... {code} These messages suggest to me no actual replication is being attempted, and that's why the problem is not manifested. To answer your other question, the test case I provided was something very simple I came up with after discovering this problem. My coworker was reading data from parquet files when I cut-n-pasted those stack traces. I'll clarify these points in the description. > Spark 2.0 off heap RDD persistence with replication factor 2 leads to > in-memory data corruption > --- > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the OFF_HEAP storage level extensively. We've tried off-heap storage > with replication factor 2 and have always received exceptions on the executor > side very shortly after starting the job. For example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at >
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191 ] Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:14 AM: -- Exposing a {{family}} or similar parameter sounds good to me. was (Author: yanboliang): Exposing a {{family}} or similar parameter to control pivoting sounds good to me. > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434412#comment-15434412 ] Yanbo Liang edited comment on SPARK-17163 at 8/25/16 3:12 AM: -- I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will be consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. FYI: SPARK-11834 and SPARK-11543. * Model store/load compatibility. Here we have two choice: consolidate them which will introduce breaking change; or keep them separately. -I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly hold my opinion if you have better proposal. Thanks!- was (Author: yanboliang): I think it's hard to unify binary and multinomial logistic regression if we do not make any breaking change. * Like [~sethah] said, we need to find a way to unify the representation of {{coefficients}} and {{intercept}}. I think flatten the matrix into a vector is still compromise, the best representation should be matrix for {{coefficients}} and vector for {{intercept}} even it's a binary classification problem. This will be consistent with other ML models such as {{NaiveBayesModel}} which is also support multi-class classification. But this will introduce big breaking change. * MLOR and LOR return different result for binary classification when regularization is used. * Current LOR code base provide both {{setThreshold}} and {{setThresholds}} for binary logistic regression and they have some interactions. If we make MLOR and LOR share the old LOR code base, it will also introduce breaking change for these APIs. FYI: SPARK-11834 and SPARK-11543. * Model store/load compatibility. Here we have two choice: consolidate them which will introduce breaking change; or keep them separately. -I'm more prefer to keep LOR and MLOR for different APIs, but not very strongly hold my opinion if you have better proposal. Thanks!- > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436191#comment-15436191 ] Yanbo Liang commented on SPARK-17163: - Exposing a {{family}} or similar parameter to control pivoting sounds good to me. > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436179#comment-15436179 ] Saisai Shao commented on SPARK-17204: - It works OK in my local test with latest build: {code} val OFF_HEAP_2 = StorageLevel(useDisk = true, useMemory = true, useOffHeap = true, deserialized = false, replication = 2) sc.range(0, 0).persist(OFF_HEAP_2).count {code} Also I'm curious why SparkSQL related code will be involved according to the exception you pasted above, are you using {{SparkSession#range}} instead. Also tested Dataset persist with {{OFF_HEAP_2}}, it also works fine without exception. > Spark 2.0 off heap RDD persistence with replication factor 2 leads to > in-memory data corruption > --- > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the OFF_HEAP storage level extensively. We've tried off-heap storage > with replication factor 2 and have always received exceptions on the executor > side very shortly after starting the job. For example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at >
[jira] [Commented] (SPARK-15382) monotonicallyIncreasingId doesn't work when data is upsampled
[ https://issues.apache.org/jira/browse/SPARK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436172#comment-15436172 ] Apache Spark commented on SPARK-15382: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/14800 > monotonicallyIncreasingId doesn't work when data is upsampled > - > > Key: SPARK-15382 > URL: https://issues.apache.org/jira/browse/SPARK-15382 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Mateusz Buśkiewicz > Fix For: 2.0.1, 2.1.0 > > > Assigned ids are not unique > {code} > from pyspark.sql import Row > from pyspark.sql.functions import monotonicallyIncreasingId > hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, > 10.0).withColumn('id', monotonicallyIncreasingId()).collect() > {code} > Output: > {code} > [Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15382) monotonicallyIncreasingId doesn't work when data is upsampled
[ https://issues.apache.org/jira/browse/SPARK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436171#comment-15436171 ] Takeshi Yamamuro commented on SPARK-15382: -- Sorry, but the master still has this bug. I made a pr, so could you check this? > monotonicallyIncreasingId doesn't work when data is upsampled > - > > Key: SPARK-15382 > URL: https://issues.apache.org/jira/browse/SPARK-15382 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Mateusz Buśkiewicz > Fix For: 2.0.1, 2.1.0 > > > Assigned ids are not unique > {code} > from pyspark.sql import Row > from pyspark.sql.functions import monotonicallyIncreasingId > hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, > 10.0).withColumn('id', monotonicallyIncreasingId()).collect() > {code} > Output: > {code} > [Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17226) Allow defining multiple date formats per column in csv
[ https://issues.apache.org/jira/browse/SPARK-17226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436158#comment-15436158 ] Hyukjin Kwon commented on SPARK-17226: -- Codes to reproduce and suggestion maybe rather than just pointing the codes? I guess it is arguable how to deal with this. There is actually already an issue about this in external CSV library, https://github.com/databricks/spark-csv/pull/359 > Allow defining multiple date formats per column in csv > -- > > Key: SPARK-17226 > URL: https://issues.apache.org/jira/browse/SPARK-17226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Useful to have fallbacks in case of messy input and different columns can > have different formats. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L106 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17222) Support multline csv records
[ https://issues.apache.org/jira/browse/SPARK-17222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436154#comment-15436154 ] Hyukjin Kwon commented on SPARK-17222: -- Here is *related* PR https://github.com/apache/spark/pull/13007 and *related* issue SPARK-15226 > Support multline csv records > > > Key: SPARK-17222 > URL: https://issues.apache.org/jira/browse/SPARK-17222 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski > > Below should be read as one record and currently it won't be since files and > records are split on new line. > {code} > "aaa","bb > b","ccc" > {code} > This shouldn't be default behaviour due to performance but should be > configurable if necessary -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17227) Allow configuring record delimiter in csv
[ https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436150#comment-15436150 ] Hyukjin Kwon commented on SPARK-17227: -- Ah, SPARK-17222 is about miltiple-lines but IMHO it might have been nicer to summarize those in the single JIRA because I guess a single PR would fix all the listed JIRAs. > Allow configuring record delimiter in csv > - > > Key: SPARK-17227 > URL: https://issues.apache.org/jira/browse/SPARK-17227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Instead of hard coded "\n" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17227) Allow configuring record delimiter in csv
[ https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436147#comment-15436147 ] Hyukjin Kwon commented on SPARK-17227: -- We may have to open a JIRA to deal with multiple-lines first. The root cause is the use of {{LineRecordReader}} and for this reason, JSON also forces the format to be in-line-document. I got rid of the weird inconsistent behavior in {{CSVParser}} in current master anyway. So, Spark's CSV datasource does not support any multiple line stuff if my understanding is correct. > Allow configuring record delimiter in csv > - > > Key: SPARK-17227 > URL: https://issues.apache.org/jira/browse/SPARK-17227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Instead of hard coded "\n" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values
[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436143#comment-15436143 ] Vincent commented on SPARK-17219: - for this scenario, we can add a new parameter for QuantileDiscretizer, a nullStrategy param as Berry mentioned. Actually, R supports such kind of option by having a "na.rm" flag for user to either remove NaN elements before quantile, or throw an error (by default). So, I think it's a nice thing to have in Spark too. > QuantileDiscretizer does strange things with NaN values > --- > > Key: SPARK-17219 > URL: https://issues.apache.org/jira/browse/SPARK-17219 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.2 >Reporter: Barry Becker > > How is the QuantileDiscretizer supposed to handle null values? > Actual nulls are not allowed, so I replace them with Double.NaN. > However, when you try to run the QuantileDiscretizer on a column that > contains NaNs, it will create (possibly more than one) NaN split(s) before > the final PositiveInfinity value. > I am using the attache titanic csv data and trying to bin the "age" column > using the QuantileDiscretizer with 10 bins specified. The age column as a lot > of null values. > These are the splits that I get: > {code} > -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity > {code} > Is that expected. It seems to imply that NaN is larger than any positive > number and less than infinity. > I'm not sure of the best way to handle nulls, but I think they need a bucket > all their own. My suggestions would be to include an initial NaN split value > that is always there, just like the sentinel Infinities are. If that were the > case, then the splits for the example above might look like this: > {code} > NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity > {code} > This does not seem great either because a bucket that is [NaN, -Inf] doesn't > make much sense. Not sure if the NaN bucket counts toward numBins or not. I > do think it should always be there though in case future data has null even > though the fit data did not. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17227) Allow configuring record delimiter in csv
[ https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436140#comment-15436140 ] Hyukjin Kwon commented on SPARK-17227: -- Also, it would be great if the JIRA has an example and problem so that this can be tested and reproduced. > Allow configuring record delimiter in csv > - > > Key: SPARK-17227 > URL: https://issues.apache.org/jira/browse/SPARK-17227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Instead of hard coded "\n" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17227) Allow configuring record delimiter in csv
[ https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436139#comment-15436139 ] Hyukjin Kwon commented on SPARK-17227: -- If I remember this correctly, we are not using that {{rowSeparator}} anymore (although this is set into the parser) and I think that should be removed as CSV datasource uses {{LineRecordReader}} for each line. > Allow configuring record delimiter in csv > - > > Key: SPARK-17227 > URL: https://issues.apache.org/jira/browse/SPARK-17227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Instead of hard coded "\n" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values
[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436136#comment-15436136 ] Vincent commented on SPARK-17219: - for cases where only null and non-null buckets are needed, I guess we dont need to call QuantileDiscretizer to do that > QuantileDiscretizer does strange things with NaN values > --- > > Key: SPARK-17219 > URL: https://issues.apache.org/jira/browse/SPARK-17219 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.2 >Reporter: Barry Becker > > How is the QuantileDiscretizer supposed to handle null values? > Actual nulls are not allowed, so I replace them with Double.NaN. > However, when you try to run the QuantileDiscretizer on a column that > contains NaNs, it will create (possibly more than one) NaN split(s) before > the final PositiveInfinity value. > I am using the attache titanic csv data and trying to bin the "age" column > using the QuantileDiscretizer with 10 bins specified. The age column as a lot > of null values. > These are the splits that I get: > {code} > -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity > {code} > Is that expected. It seems to imply that NaN is larger than any positive > number and less than infinity. > I'm not sure of the best way to handle nulls, but I think they need a bucket > all their own. My suggestions would be to include an initial NaN split value > that is always there, just like the sentinel Infinities are. If that were the > case, then the splits for the example above might look like this: > {code} > NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity > {code} > This does not seem great either because a bucket that is [NaN, -Inf] doesn't > make much sense. Not sure if the NaN bucket counts toward numBins or not. I > do think it should always be there though in case future data has null even > though the fit data did not. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16216) CSV data source does not write date and timestamp correctly
[ https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436124#comment-15436124 ] Apache Spark commented on SPARK-16216: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/14799 > CSV data source does not write date and timestamp correctly > --- > > Key: SPARK-16216 > URL: https://issues.apache.org/jira/browse/SPARK-16216 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Blocker > Labels: releasenotes > > Currently, CSV data source write {{DateType}} and {{TimestampType}} as below: > {code} > ++ > |date| > ++ > |14406372| > |14144598| > |14540400| > ++ > {code} > It would be nicer if it write dates and timestamps as a formatted string just > like JSON data sources. > Also, CSV data source currently supports {{dateFormat}} option to read dates > and timestamps in a custom format. It might be better if this option can be > applied in writing as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17110) Pyspark with locality ANY throw java.io.StreamCorruptedException
[ https://issues.apache.org/jira/browse/SPARK-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436120#comment-15436120 ] Gen TANG commented on SPARK-17110: -- [~radost...@gmail.com], It seems spark scala doesn't have this bug in version 2.0.0 > Pyspark with locality ANY throw java.io.StreamCorruptedException > > > Key: SPARK-17110 > URL: https://issues.apache.org/jira/browse/SPARK-17110 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: Cluster of 2 AWS r3.xlarge nodes launched via ec2 > scripts, Spark 2.0.0, hadoop: yarn, pyspark shell >Reporter: Tomer Kaftan >Priority: Critical > > In Pyspark 2.0.0, any task that accesses cached data non-locally throws a > StreamCorruptedException like the stacktrace below: > {noformat} > WARN TaskSetManager: Lost task 7.0 in stage 2.0 (TID 26, 172.31.26.184): > java.io.StreamCorruptedException: invalid stream header: 12010A80 > at > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:807) > at java.io.ObjectInputStream.(ObjectInputStream.java:302) > at > org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63) > at > org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122) > at > org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:146) > at > org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:524) > at > org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:522) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522) > at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The simplest way I have found to reproduce this is by running the following > code in the pyspark shell, on a cluster of 2 nodes set to use only one worker > core each: > {code} > x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache() > x.count() > import time > def waitMap(x): > time.sleep(x) > return x > x.map(waitMap).count() > {code} > Or by running the following via spark-submit: > {code} > from pyspark import SparkContext > sc = SparkContext() > x = sc.parallelize([1, 1, 1, 1, 1, 1000, 1, 1, 1], numSlices=9).cache() > x.count() > import time > def waitMap(x): > time.sleep(x) > return x > x.map(waitMap).count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436033#comment-15436033 ] Seth Hendrickson commented on SPARK-17163: -- BTW, I am happy to take care of merging the interfaces if we decide to. I did a bit of exploration today and it seems like it will be rather straightforward. > Decide on unified multinomial and binary logistic regression interfaces > --- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17156) Add multiclass logistic regression Scala Example
[ https://issues.apache.org/jira/browse/SPARK-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436004#comment-15436004 ] Miao Wang commented on SPARK-17156: --- Two quick comments: 1). Add some comments like in the LogisticRegressionExample; 2). For the dataset, can you use similar dataset as in the tests? > Add multiclass logistic regression Scala Example > > > Key: SPARK-17156 > URL: https://issues.apache.org/jira/browse/SPARK-17156 > Project: Spark > Issue Type: Task > Components: ML >Reporter: Miao Wang > > As [SPARK-7159][ML] Add multiclass logistic regression to Spark ML has been > merged to master, we should add Scala example of using multiclass logistic > regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper
[ https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435996#comment-15435996 ] Miao Wang commented on SPARK-17157: --- Start working on it now. > Add multiclass logistic regression SparkR Wrapper > - > > Key: SPARK-17157 > URL: https://issues.apache.org/jira/browse/SPARK-17157 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Miao Wang > > [SPARK-7159][ML] Add multiclass logistic regression to Spark ML has been > merged to Master. I open this JIRA for discussion of adding SparkR wrapper > for multiclass logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17231: Assignee: (was: Apache Spark) > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Priority: Minor > Attachments: logging_perf_improvements 2.jpg, > logging_perf_improvements.jpg, master 2.jpg, master.jpg > > > While debugging the performance of a large GraphX connected components > computation, I found several places in the {{network-common}} and > {{network-shuffle}} code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17231: Assignee: Apache Spark > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Assignee: Apache Spark >Priority: Minor > Attachments: logging_perf_improvements 2.jpg, > logging_perf_improvements.jpg, master 2.jpg, master.jpg > > > While debugging the performance of a large GraphX connected components > computation, I found several places in the {{network-common}} and > {{network-shuffle}} code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435958#comment-15435958 ] Apache Spark commented on SPARK-17231: -- User 'mallman' has created a pull request for this issue: https://github.com/apache/spark/pull/14798 > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Priority: Minor > Attachments: logging_perf_improvements 2.jpg, > logging_perf_improvements.jpg, master 2.jpg, master.jpg > > > While debugging the performance of a large GraphX connected components > computation, I found several places in the {{network-common}} and > {{network-shuffle}} code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17231: --- Description: While debugging the performance of a large GraphX connected components computation, I found several places in the {{network-common}} and {{network-shuffle}} code bases where trace or debug log messages are constructed even if the respective log level is disabled. Refactoring the respective code to avoid these constructions except where necessary led to a modest but measurable reduction in task time, GC time and the ratio thereof. (was: While debugging the performance of a large GraphX connected components computation, I found several places in the {{network-common}} and {{network-shuffle}} code bases where trace or debug log messages are constructed even if the respective log level is disabled. Refactoring the respective code to avoid these constructions except where necessary led to a modest but measurable reduction in task time, GC time and the ratio thereof. (PR to come.)) > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Priority: Minor > Attachments: logging_perf_improvements 2.jpg, > logging_perf_improvements.jpg, master 2.jpg, master.jpg > > > While debugging the performance of a large GraphX connected components > computation, I found several places in the {{network-common}} and > {{network-shuffle}} code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435767#comment-15435767 ] Michael Allman edited comment on SPARK-17231 at 8/24/16 11:29 PM: -- I've attached screenshots from four independent test runs on four different EC2 clusters with the same configuration. The only differences are that the master files are from test runs without the logging patches and the logging_perf_improvements files are from test runs with the logging patches. was (Author: michael): Note that in the attached screenshots, all stats are the same except task and gc time. > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Priority: Minor > Attachments: logging_perf_improvements 2.jpg, > logging_perf_improvements.jpg, master 2.jpg, master.jpg > > > While debugging the performance of a large GraphX connected components > computation, I found several places in the {{network-common}} and > {{network-shuffle}} code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. > (PR to come.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17231: --- Attachment: master 2.jpg logging_perf_improvements 2.jpg > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Priority: Minor > Attachments: logging_perf_improvements 2.jpg, > logging_perf_improvements.jpg, master 2.jpg, master.jpg > > > While debugging the performance of a large GraphX connected components > computation, I found several places in the {{network-common}} and > {{network-shuffle}} code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. > (PR to come.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17232) Expecting same behavior after loading a dataframe with dots in column name
Louis Salin created SPARK-17232: --- Summary: Expecting same behavior after loading a dataframe with dots in column name Key: SPARK-17232 URL: https://issues.apache.org/jira/browse/SPARK-17232 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Louis Salin In Spark 2.0, the behavior of a dataframe changes after saving and reloading it when there are dots in the column names. In the example below, I was able to call the {{rdd}} function for a newly created dataframe. However, after saving it and reloading it, an exception gets thrown when calling the {{rdd}} function. from a spark-shell: {{scala> val simpleDf = Seq((1, 2)).toDF("a.b", "a.c")}} Res1: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\] {{scala> simpleDf.rdd}} Res2: org.apache.spark.rdd.RDD\[org.apache.spark.sql.Row\] = MapPartitionsRDD\[7\] at rdd at :29 {{scala> simpleDf.write.parquet("/user/lsalin/simpleDf")}} {{scala> val readDf = spark.read.parquet("/user/lsalin/simpleDf")}} Res4: org.apache.spark.sql.DataFrame = \[a.b: int, a.c: int\] {{scala> readDf.rdd}} {noformat} org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b, a.c]; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at org.apache.spark.sql.types.StructType.map(StructType.scala:95) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129) at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61) at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51) at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17231: --- Description: While debugging the performance of a large GraphX connected components computation, I found several places in the {{network-common}} and {{network-shuffle}} code bases where trace or debug log messages are constructed even if the respective log level is disabled. Refactoring the respective code to avoid these constructions except where necessary led to a modest but measurable reduction in task time, GC time and the ratio thereof. (PR to come.) was:While debugging the performance of a large GraphX connected components computation, I found several places in the {{network-common}} and {{network-shuffle}} code bases where trace or debug log messages are constructed even if the respective log level is disabled. Refactoring the respective code to avoid these constructions except where necessary led to a modest but measurable reduction in task time, GC time and the ratio thereof. > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Priority: Minor > Attachments: logging_perf_improvements.jpg, master.jpg > > > While debugging the performance of a large GraphX connected components > computation, I found several places in the {{network-common}} and > {{network-shuffle}} code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. > (PR to come.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17231: --- Description: While debugging the performance of a large GraphX connected components computation, I found several places in the {{network-common}} and {{network-shuffle}} code bases where trace or debug log messages are constructed even if the respective log level is disabled. Refactoring the respective code to avoid these constructions except where necessary led to a modest but measurable reduction in task time, GC time and the ratio thereof. (was: While debugging the performance of a large GraphX connected components computation, I found several places in the `network-common` and `network-shuffle` code bases where trace or debug log messages are constructed even if the respective log level is disabled. Refactoring the respective code to avoid these constructions except where necessary led to a modest but measurable reduction in task time, GC time and the ratio thereof.) > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Priority: Minor > Attachments: logging_perf_improvements.jpg, master.jpg > > > While debugging the performance of a large GraphX connected components > computation, I found several places in the {{network-common}} and > {{network-shuffle}} code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17123) Performing set operations that combine string and date / timestamp columns may result in generated projection code which doesn't compile
[ https://issues.apache.org/jira/browse/SPARK-17123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435773#comment-15435773 ] Dongjoon Hyun commented on SPARK-17123: --- Hi, [~joshrosen]. I'll make a PR for this. > Performing set operations that combine string and date / timestamp columns > may result in generated projection code which doesn't compile > > > Key: SPARK-17123 > URL: https://issues.apache.org/jira/browse/SPARK-17123 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Priority: Minor > > The following example program causes SpecificSafeProjection code generation > to produce Java code which doesn't compile: > {code} > import org.apache.spark.sql.types._ > spark.sql("set spark.sql.codegen.fallback=false") > val dateDF = spark.createDataFrame(sc.parallelize(Seq(Row(new > java.sql.Date(0, StructType(StructField("value", DateType) :: Nil)) > val longDF = sc.parallelize(Seq(new java.sql.Date(0).toString)).toDF > dateDF.union(longDF).collect() > {code} > This fails at runtime with the following error: > {code} > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 28, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates > are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private org.apache.spark.sql.types.StructType schema; > /* 011 */ > /* 012 */ > /* 013 */ public SpecificSafeProjection(Object[] references) { > /* 014 */ this.references = references; > /* 015 */ mutableRow = (MutableRow) references[references.length - 1]; > /* 016 */ > /* 017 */ this.schema = (org.apache.spark.sql.types.StructType) > references[0]; > /* 018 */ } > /* 019 */ > /* 020 */ public java.lang.Object apply(java.lang.Object _i) { > /* 021 */ InternalRow i = (InternalRow) _i; > /* 022 */ > /* 023 */ values = new Object[1]; > /* 024 */ > /* 025 */ boolean isNull2 = i.isNullAt(0); > /* 026 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 027 */ boolean isNull1 = isNull2; > /* 028 */ final java.sql.Date value1 = isNull1 ? null : > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2); > /* 029 */ isNull1 = value1 == null; > /* 030 */ if (isNull1) { > /* 031 */ values[0] = null; > /* 032 */ } else { > /* 033 */ values[0] = value1; > /* 034 */ } > /* 035 */ > /* 036 */ final org.apache.spark.sql.Row value = new > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, > schema); > /* 037 */ if (false) { > /* 038 */ mutableRow.setNullAt(0); > /* 039 */ } else { > /* 040 */ > /* 041 */ mutableRow.update(0, value); > /* 042 */ } > /* 043 */ > /* 044 */ return mutableRow; > /* 045 */ } > /* 046 */ } > {code} > Here, the invocation of {{DateTimeUtils.toJavaDate}} is incorrect because the > generated code tries to call it with a UTF8String while the method expects an > int instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17231: --- Description: While debugging the performance of a large GraphX connected components computation, I found several places in the `network-common` and `network-shuffle` code bases where trace or debug log messages are constructed even if the respective log level is disabled. Refactoring the respective code to avoid these constructions except where necessary led to a modest but measurable reduction in task time, GC time and the ratio thereof. (was: While debugging the performance of a large GraphX connected components computation, I found several places in the `network-common` and `network-shuffle` code bases where trace or debug log messages are constructed even if the respective log level is disabled. Refactoring the respective code to avoid these constructions except where necessary led to a modest but measurable reduction in task time, GC time and the ratio thereof. (PR to follow.)) > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Priority: Minor > Attachments: logging_perf_improvements.jpg, master.jpg > > > While debugging the performance of a large GraphX connected components > computation, I found several places in the `network-common` and > `network-shuffle` code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17231: --- Description: While debugging the performance of a large GraphX connected components computation, I found several places in the `network-common` and `network-shuffle` code bases where trace or debug log messages are constructed even if the respective log level is disabled. Refactoring the respective code to avoid these constructions except where necessary led to a modest but measurable reduction in task time, GC time and the ratio thereof. (PR to follow.) was: While debugging the performance of a large GraphX connected components computation, I found several places in the `network-common` and `network-shuffle` code bases where trace or debug log messages are constructed even if the respective log level is disabled. Refactoring the respective code to avoid these constructions except where necessary led to a modest but measurable reduction in task time, GC time and the ratio thereof. (Before and after executor stats to follow in screenshots.) (PR to follow.) > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Priority: Minor > Attachments: logging_perf_improvements.jpg, master.jpg > > > While debugging the performance of a large GraphX connected components > computation, I found several places in the `network-common` and > `network-shuffle` code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. > (PR to follow.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435767#comment-15435767 ] Michael Allman commented on SPARK-17231: Note that in the attached screenshots, all stats are the same except task and gc time. > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Priority: Minor > Attachments: logging_perf_improvements.jpg, master.jpg > > > While debugging the performance of a large GraphX connected components > computation, I found several places in the `network-common` and > `network-shuffle` code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. > (Before and after executor stats to follow in screenshots.) > (PR to follow.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17231: --- Attachment: logging_perf_improvements.jpg master.jpg > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Priority: Minor > Attachments: logging_perf_improvements.jpg, master.jpg > > > While debugging the performance of a large GraphX connected components > computation, I found several places in the `network-common` and > `network-shuffle` code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. > (Before and after executor stats to follow in screenshots.) > (PR to follow.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17231: --- Description: While debugging the performance of a large GraphX connected components computation, I found several places in the `network-common` and `network-shuffle` code bases where trace or debug log messages are constructed even if the respective log level is disabled. Refactoring the respective code to avoid these constructions except where necessary led to a modest but measurable reduction in task time, GC time and the ratio thereof. (Before and after executor stats to follow in screenshots.) (PR to follow.) was: While debugging the performance of a large GraphX connected components computation, I found several places in the `network-common` and `network-shuffle` code bases where trace or debug log messages are constructed even if the respective log level is disabled. Refactoring the respective code to avoid these constructions except where necessary led to a modest but measurable performance improvement. (Before and after executor stats to follow in screenshots.) (PR to follow.) > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Priority: Minor > > While debugging the performance of a large GraphX connected components > computation, I found several places in the `network-common` and > `network-shuffle` code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. > (Before and after executor stats to follow in screenshots.) > (PR to follow.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
Michael Allman created SPARK-17231: -- Summary: Avoid building debug or trace log messages unless the respective log level is enabled Key: SPARK-17231 URL: https://issues.apache.org/jira/browse/SPARK-17231 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.0.0 Environment: Spark cluster with 8 r3.8xl EC2 worker instances Reporter: Michael Allman Priority: Minor While debugging the performance of a large GraphX connected components computation, I found several places in the `network-common` and `network-shuffle` code bases where trace or debug log messages are constructed even if the respective log level is disabled. Refactoring the respective code to avoid these constructions except where necessary led to a modest but measurable performance improvement. (Before and after executor stats to follow in screenshots.) (PR to follow.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17211) Broadcast join produces incorrect results
[ https://issues.apache.org/jira/browse/SPARK-17211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435745#comment-15435745 ] Himanish Kushary commented on SPARK-17211: -- I ran the following in a Databricks environment with Spark 2.0. Works fine. {code:java} import spark.implicits._ val a1 = Array((123,1),(234,2),(432,5)) val a2 = Array(("abc",1),("bcd",2),("dcb",5)) val df1 = sc.parallelize(a1).toDF("gid","id") val df2 = sc.parallelize(a2).toDF("gname","id") df1.join(df2,"id").show() // WORKS +---+---+-+ | id|gid|gname| +---+---+-+ | 5|432| dcb| | 2|234| bcd| | 1|123| abc| +---+---+-+ df1.join(broadcast(df2),"id").show() // BROADCASTING - DOES NOT WORK on EMR +---+---+-+ | id|gid|gname| +---+---+-+ | 1|123| null| | 2|234| null| | 5|432| null| +---+---+-+ broadcast(df1).join(df2,"id").show() // BROADCASTING - DOES NOT WORK on EMR {code} > Broadcast join produces incorrect results > - > > Key: SPARK-17211 > URL: https://issues.apache.org/jira/browse/SPARK-17211 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Jarno Seppanen > > Broadcast join produces incorrect columns in join result, see below for an > example. The same join but without using broadcast gives the correct columns. > Running PySpark on YARN on Amazon EMR 5.0.0. > {noformat} > import pyspark.sql.functions as func > keys = [ > (5400, 0), > (5401, 1), > (5402, 2), > ] > keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1) > keys_df.show() > # ++-+ > # | key_id|value| > # ++-+ > # |5400|0| > # |5401|1| > # |5402|2| > # ++-+ > data = [ > (5402,1), > (5400,2), > (5401,3), > ] > data_df = spark.createDataFrame(data, ['key_id', 'foo']) > data_df.show() > # ++---+ > > # | key_id|foo| > # ++---+ > # |5402| 1| > # |5400| 2| > # |5401| 3| > # ++---+ > ### INCORRECT ### > data_df.join(func.broadcast(keys_df), 'key_id').show() > # ++---++ > > # | key_id|foo| value| > # ++---++ > # |5402| 1|5402| > # |5400| 2|5400| > # |5401| 3|5401| > # ++---++ > ### CORRECT ### > data_df.join(keys_df, 'key_id').show() > # ++---+-+ > # | key_id|foo|value| > # ++---+-+ > # |5400| 2|0| > # |5401| 3|1| > # |5402| 1|2| > # ++---+-+ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17230) Writing decimal to csv will result empty string if the decimal exceeds (20, 18)
[ https://issues.apache.org/jira/browse/SPARK-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17230: Assignee: Davies Liu (was: Apache Spark) > Writing decimal to csv will result empty string if the decimal exceeds (20, > 18) > --- > > Key: SPARK-17230 > URL: https://issues.apache.org/jira/browse/SPARK-17230 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu > > {code} > // file content > spark.read.csv("/mnt/djiang/test-case.csv").show > // read in as string and create temp view > spark.read.csv("/mnt/djiang/test-case.csv").createOrReplaceTempView("test") > // confirm schema > spark.table("test").printSchema > // apply decimal calculation, confirm the result is correct > spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) > from test").show(false) > // run the same query, and write out as csv > spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) > from test").write.csv("/mnt/djiang/test-case-result") > // show the content of the result file, particularly, for number exceeded > decimal(20, 18), the csv is not writing anything or failing silently > spark.read.csv("/mnt/djiang/test-case-result").show > +--+ > | _c0| > +--+ > | 1| > | 10| > | 100| > | 1000| > | 1| > |10| > +--+ > root > |-- _c0: string (nullable = true) > +--+-+ > > |_c0 |(CAST(CAST(CAST(CAST(_c0 AS DECIMAL(20,0)) AS BIGINT) AS DECIMAL(20,0)) > AS DECIMAL(38,18)) * CAST(CAST(1.0 AS DECIMAL(38,18)) AS DECIMAL(38,18)))| > +--+-+ > > |1 |1.00 | > |10 |10.00 | > |100 |100.00 | > |1000 |1000.00 | > |1 |1.00 | > |10|10.00 | > +--+-+ > +--++ > | _c0| _c1| > +--++ > | 1|1.00| > | 10|10.00...| > | 100| | > | 1000| | > | 1| | > |10| | > +--++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17230) Writing decimal to csv will result empty string if the decimal exceeds (20, 18)
[ https://issues.apache.org/jira/browse/SPARK-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17230: Assignee: Apache Spark (was: Davies Liu) > Writing decimal to csv will result empty string if the decimal exceeds (20, > 18) > --- > > Key: SPARK-17230 > URL: https://issues.apache.org/jira/browse/SPARK-17230 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0 >Reporter: Davies Liu >Assignee: Apache Spark > > {code} > // file content > spark.read.csv("/mnt/djiang/test-case.csv").show > // read in as string and create temp view > spark.read.csv("/mnt/djiang/test-case.csv").createOrReplaceTempView("test") > // confirm schema > spark.table("test").printSchema > // apply decimal calculation, confirm the result is correct > spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) > from test").show(false) > // run the same query, and write out as csv > spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) > from test").write.csv("/mnt/djiang/test-case-result") > // show the content of the result file, particularly, for number exceeded > decimal(20, 18), the csv is not writing anything or failing silently > spark.read.csv("/mnt/djiang/test-case-result").show > +--+ > | _c0| > +--+ > | 1| > | 10| > | 100| > | 1000| > | 1| > |10| > +--+ > root > |-- _c0: string (nullable = true) > +--+-+ > > |_c0 |(CAST(CAST(CAST(CAST(_c0 AS DECIMAL(20,0)) AS BIGINT) AS DECIMAL(20,0)) > AS DECIMAL(38,18)) * CAST(CAST(1.0 AS DECIMAL(38,18)) AS DECIMAL(38,18)))| > +--+-+ > > |1 |1.00 | > |10 |10.00 | > |100 |100.00 | > |1000 |1000.00 | > |1 |1.00 | > |10|10.00 | > +--+-+ > +--++ > | _c0| _c1| > +--++ > | 1|1.00| > | 10|10.00...| > | 100| | > | 1000| | > | 1| | > |10| | > +--++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17230) Writing decimal to csv will result empty string if the decimal exceeds (20, 18)
[ https://issues.apache.org/jira/browse/SPARK-17230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435725#comment-15435725 ] Apache Spark commented on SPARK-17230: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/14797 > Writing decimal to csv will result empty string if the decimal exceeds (20, > 18) > --- > > Key: SPARK-17230 > URL: https://issues.apache.org/jira/browse/SPARK-17230 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu > > {code} > // file content > spark.read.csv("/mnt/djiang/test-case.csv").show > // read in as string and create temp view > spark.read.csv("/mnt/djiang/test-case.csv").createOrReplaceTempView("test") > // confirm schema > spark.table("test").printSchema > // apply decimal calculation, confirm the result is correct > spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) > from test").show(false) > // run the same query, and write out as csv > spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) > from test").write.csv("/mnt/djiang/test-case-result") > // show the content of the result file, particularly, for number exceeded > decimal(20, 18), the csv is not writing anything or failing silently > spark.read.csv("/mnt/djiang/test-case-result").show > +--+ > | _c0| > +--+ > | 1| > | 10| > | 100| > | 1000| > | 1| > |10| > +--+ > root > |-- _c0: string (nullable = true) > +--+-+ > > |_c0 |(CAST(CAST(CAST(CAST(_c0 AS DECIMAL(20,0)) AS BIGINT) AS DECIMAL(20,0)) > AS DECIMAL(38,18)) * CAST(CAST(1.0 AS DECIMAL(38,18)) AS DECIMAL(38,18)))| > +--+-+ > > |1 |1.00 | > |10 |10.00 | > |100 |100.00 | > |1000 |1000.00 | > |1 |1.00 | > |10|10.00 | > +--+-+ > +--++ > | _c0| _c1| > +--++ > | 1|1.00| > | 10|10.00...| > | 100| | > | 1000| | > | 1| | > |10| | > +--++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation
[ https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17120: Assignee: Apache Spark > Analyzer incorrectly optimizes plan to empty LocalRelation > -- > > Key: SPARK-17120 > URL: https://issues.apache.org/jira/browse/SPARK-17120 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Blocker > > Consider the following query: > {code} > sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3") > sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") > println(sql(""" > SELECT > * > FROM ( > SELECT > COALESCE(t2.int_col_1, t1.int_col_6) AS int_col > FROM table_3 t1 > LEFT JOIN table_4 t2 ON false > ) t where (t.int_col) is not null > """).collect().toSeq) > {code} > In the innermost query, the LEFT JOIN's condition is {{false}} but > nevertheless the number of rows produced should equal the number of rows in > {{table_3}} (which is non-empty). Since no values are {{null}}, the outer > {{where}} should retain all rows, so the overall result of this query should > contain a single row with the value '97'. > Instead, the current Spark master (as of > 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking > at {{explain}}, it appears that the logical plan is optimizing to > {{LocalRelation }}, so Spark doesn't even run the query. My suspicion > is that there's a bug in constraint propagation or filter pushdown. > This issue doesn't seem to affect Spark 2.0, so I think it's a regression in > master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation
[ https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17120: Assignee: (was: Apache Spark) > Analyzer incorrectly optimizes plan to empty LocalRelation > -- > > Key: SPARK-17120 > URL: https://issues.apache.org/jira/browse/SPARK-17120 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > Consider the following query: > {code} > sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3") > sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") > println(sql(""" > SELECT > * > FROM ( > SELECT > COALESCE(t2.int_col_1, t1.int_col_6) AS int_col > FROM table_3 t1 > LEFT JOIN table_4 t2 ON false > ) t where (t.int_col) is not null > """).collect().toSeq) > {code} > In the innermost query, the LEFT JOIN's condition is {{false}} but > nevertheless the number of rows produced should equal the number of rows in > {{table_3}} (which is non-empty). Since no values are {{null}}, the outer > {{where}} should retain all rows, so the overall result of this query should > contain a single row with the value '97'. > Instead, the current Spark master (as of > 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking > at {{explain}}, it appears that the logical plan is optimizing to > {{LocalRelation }}, so Spark doesn't even run the query. My suspicion > is that there's a bug in constraint propagation or filter pushdown. > This issue doesn't seem to affect Spark 2.0, so I think it's a regression in > master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17226) Allow defining multiple date formats per column in csv
[ https://issues.apache.org/jira/browse/SPARK-17226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435709#comment-15435709 ] Robert Kruszewski commented on SPARK-17226: --- Anything in particular you have in mind? I should have defined components and versions beforehand, that's my mistake. It's useful to have an issue for tracking in case anyone else comes looking for it. > Allow defining multiple date formats per column in csv > -- > > Key: SPARK-17226 > URL: https://issues.apache.org/jira/browse/SPARK-17226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Useful to have fallbacks in case of messy input and different columns can > have different formats. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L106 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17099) Incorrect result when HAVING clause is added to group by query
[ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17099: Assignee: (was: Apache Spark) > Incorrect result when HAVING clause is added to group by query > -- > > Key: SPARK-17099 > URL: https://issues.apache.org/jira/browse/SPARK-17099 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > Random query generation uncovered the following query which returns incorrect > results when run on Spark SQL. This wasn't the original query uncovered by > the generator, since I performed a bit of minimization to try to make it more > understandable. > With the following tables: > {code} > val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5") > val t2 = sc.parallelize( > Seq( > (-769, -244), > (-800, -409), > (940, 86), > (-507, 304), > (-367, 158)) > ).toDF("int_col_2", "int_col_5") > t1.registerTempTable("t1") > t2.registerTempTable("t2") > {code} > Run > {code} > SELECT > (SUM(COALESCE(t1.int_col_5, t2.int_col_2))), > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) > FROM t1 > RIGHT JOIN t2 > ON (t2.int_col_2) = (t1.int_col_5) > GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)), > COALESCE(t1.int_col_5, t2.int_col_2) > HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, > t2.int_col_2)) * 2) > {code} > In Spark SQL, this returns an empty result set, whereas Postgres returns four > rows. However, if I omit the {{HAVING}} clause I see that the group's rows > are being incorrectly filtered by the {{HAVING}} clause: > {code} > +--+---+--+ > | sum(coalesce(int_col_5, int_col_2)) | (coalesce(int_col_5, int_col_2) * 2) > | > +--+---+--+ > | -507 | -1014 > | > | 940 | 1880 > | > | -769 | -1538 > | > | -367 | -734 > | > | -800 | -1600 > | > +--+---+--+ > {code} > Based on this, the output after adding the {{HAVING}} should contain four > rows, not zero. > I'm not sure how to further shrink this in a straightforward way, so I'm > opening this bug to get help in triaging further. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation
[ https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435714#comment-15435714 ] Apache Spark commented on SPARK-17120: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14661 > Analyzer incorrectly optimizes plan to empty LocalRelation > -- > > Key: SPARK-17120 > URL: https://issues.apache.org/jira/browse/SPARK-17120 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > Consider the following query: > {code} > sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3") > sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") > println(sql(""" > SELECT > * > FROM ( > SELECT > COALESCE(t2.int_col_1, t1.int_col_6) AS int_col > FROM table_3 t1 > LEFT JOIN table_4 t2 ON false > ) t where (t.int_col) is not null > """).collect().toSeq) > {code} > In the innermost query, the LEFT JOIN's condition is {{false}} but > nevertheless the number of rows produced should equal the number of rows in > {{table_3}} (which is non-empty). Since no values are {{null}}, the outer > {{where}} should retain all rows, so the overall result of this query should > contain a single row with the value '97'. > Instead, the current Spark master (as of > 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking > at {{explain}}, it appears that the logical plan is optimizing to > {{LocalRelation }}, so Spark doesn't even run the query. My suspicion > is that there's a bug in constraint propagation or filter pushdown. > This issue doesn't seem to affect Spark 2.0, so I think it's a regression in > master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17099) Incorrect result when HAVING clause is added to group by query
[ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435711#comment-15435711 ] Apache Spark commented on SPARK-17099: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14661 > Incorrect result when HAVING clause is added to group by query > -- > > Key: SPARK-17099 > URL: https://issues.apache.org/jira/browse/SPARK-17099 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > Random query generation uncovered the following query which returns incorrect > results when run on Spark SQL. This wasn't the original query uncovered by > the generator, since I performed a bit of minimization to try to make it more > understandable. > With the following tables: > {code} > val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5") > val t2 = sc.parallelize( > Seq( > (-769, -244), > (-800, -409), > (940, 86), > (-507, 304), > (-367, 158)) > ).toDF("int_col_2", "int_col_5") > t1.registerTempTable("t1") > t2.registerTempTable("t2") > {code} > Run > {code} > SELECT > (SUM(COALESCE(t1.int_col_5, t2.int_col_2))), > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) > FROM t1 > RIGHT JOIN t2 > ON (t2.int_col_2) = (t1.int_col_5) > GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)), > COALESCE(t1.int_col_5, t2.int_col_2) > HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, > t2.int_col_2)) * 2) > {code} > In Spark SQL, this returns an empty result set, whereas Postgres returns four > rows. However, if I omit the {{HAVING}} clause I see that the group's rows > are being incorrectly filtered by the {{HAVING}} clause: > {code} > +--+---+--+ > | sum(coalesce(int_col_5, int_col_2)) | (coalesce(int_col_5, int_col_2) * 2) > | > +--+---+--+ > | -507 | -1014 > | > | 940 | 1880 > | > | -769 | -1538 > | > | -367 | -734 > | > | -800 | -1600 > | > +--+---+--+ > {code} > Based on this, the output after adding the {{HAVING}} should contain four > rows, not zero. > I'm not sure how to further shrink this in a straightforward way, so I'm > opening this bug to get help in triaging further. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17099) Incorrect result when HAVING clause is added to group by query
[ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17099: Assignee: Apache Spark > Incorrect result when HAVING clause is added to group by query > -- > > Key: SPARK-17099 > URL: https://issues.apache.org/jira/browse/SPARK-17099 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Blocker > > Random query generation uncovered the following query which returns incorrect > results when run on Spark SQL. This wasn't the original query uncovered by > the generator, since I performed a bit of minimization to try to make it more > understandable. > With the following tables: > {code} > val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5") > val t2 = sc.parallelize( > Seq( > (-769, -244), > (-800, -409), > (940, 86), > (-507, 304), > (-367, 158)) > ).toDF("int_col_2", "int_col_5") > t1.registerTempTable("t1") > t2.registerTempTable("t2") > {code} > Run > {code} > SELECT > (SUM(COALESCE(t1.int_col_5, t2.int_col_2))), > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) > FROM t1 > RIGHT JOIN t2 > ON (t2.int_col_2) = (t1.int_col_5) > GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)), > COALESCE(t1.int_col_5, t2.int_col_2) > HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, > t2.int_col_2)) * 2) > {code} > In Spark SQL, this returns an empty result set, whereas Postgres returns four > rows. However, if I omit the {{HAVING}} clause I see that the group's rows > are being incorrectly filtered by the {{HAVING}} clause: > {code} > +--+---+--+ > | sum(coalesce(int_col_5, int_col_2)) | (coalesce(int_col_5, int_col_2) * 2) > | > +--+---+--+ > | -507 | -1014 > | > | 940 | 1880 > | > | -769 | -1538 > | > | -367 | -734 > | > | -800 | -1600 > | > +--+---+--+ > {code} > Based on this, the output after adding the {{HAVING}} should contain four > rows, not zero. > I'm not sure how to further shrink this in a straightforward way, so I'm > opening this bug to get help in triaging further. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17226) Allow defining multiple date formats per column in csv
[ https://issues.apache.org/jira/browse/SPARK-17226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-17226: -- Description: Useful to have fallbacks in case of messy input and different columns can have different formats. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L106 was:Useful to have fallbacks in case of messy input and different columns can have different formats. > Allow defining multiple date formats per column in csv > -- > > Key: SPARK-17226 > URL: https://issues.apache.org/jira/browse/SPARK-17226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Useful to have fallbacks in case of messy input and different columns can > have different formats. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L106 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17225) Support multiple null values in csv files
[ https://issues.apache.org/jira/browse/SPARK-17225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-17225: -- Component/s: SQL > Support multiple null values in csv files > - > > Key: SPARK-17225 > URL: https://issues.apache.org/jira/browse/SPARK-17225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Since we're dealing with strings it's useful to have multiple different > representations of null values as data might not be fully normalized -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17224) Support skipping multiple header rows in csv
[ https://issues.apache.org/jira/browse/SPARK-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-17224: -- Component/s: SQL > Support skipping multiple header rows in csv > > > Key: SPARK-17224 > URL: https://issues.apache.org/jira/browse/SPARK-17224 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Headers can be multiline and sometimes you want to skip multiple rows because > of the format you've been given -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17225) Support multiple null values in csv files
[ https://issues.apache.org/jira/browse/SPARK-17225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-17225: -- Affects Version/s: 2.0.0 > Support multiple null values in csv files > - > > Key: SPARK-17225 > URL: https://issues.apache.org/jira/browse/SPARK-17225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Since we're dealing with strings it's useful to have multiple different > representations of null values as data might not be fully normalized -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17226) Allow defining multiple date formats per column in csv
[ https://issues.apache.org/jira/browse/SPARK-17226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-17226: -- Component/s: SQL > Allow defining multiple date formats per column in csv > -- > > Key: SPARK-17226 > URL: https://issues.apache.org/jira/browse/SPARK-17226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Useful to have fallbacks in case of messy input and different columns can > have different formats. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17224) Support skipping multiple header rows in csv
[ https://issues.apache.org/jira/browse/SPARK-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-17224: -- Affects Version/s: 2.0.0 > Support skipping multiple header rows in csv > > > Key: SPARK-17224 > URL: https://issues.apache.org/jira/browse/SPARK-17224 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Headers can be multiline and sometimes you want to skip multiple rows because > of the format you've been given -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17226) Allow defining multiple date formats per column in csv
[ https://issues.apache.org/jira/browse/SPARK-17226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-17226: -- Affects Version/s: 2.0.0 > Allow defining multiple date formats per column in csv > -- > > Key: SPARK-17226 > URL: https://issues.apache.org/jira/browse/SPARK-17226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Useful to have fallbacks in case of messy input and different columns can > have different formats. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17227) Allow configuring record delimiter in csv
[ https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-17227: -- Affects Version/s: 2.0.0 > Allow configuring record delimiter in csv > - > > Key: SPARK-17227 > URL: https://issues.apache.org/jira/browse/SPARK-17227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Instead of hard coded "\n" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17227) Allow configuring record delimiter in csv
[ https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-17227: -- Component/s: SQL > Allow configuring record delimiter in csv > - > > Key: SPARK-17227 > URL: https://issues.apache.org/jira/browse/SPARK-17227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski >Priority: Minor > > Instead of hard coded "\n" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17222) Support multline csv records
[ https://issues.apache.org/jira/browse/SPARK-17222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-17222: -- Affects Version/s: 2.0.0 > Support multline csv records > > > Key: SPARK-17222 > URL: https://issues.apache.org/jira/browse/SPARK-17222 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski > > Below should be read as one record and currently it won't be since files and > records are split on new line. > {code} > "aaa","bb > b","ccc" > {code} > This shouldn't be default behaviour due to performance but should be > configurable if necessary -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17222) Support multline csv records
[ https://issues.apache.org/jira/browse/SPARK-17222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kruszewski updated SPARK-17222: -- Component/s: SQL > Support multline csv records > > > Key: SPARK-17222 > URL: https://issues.apache.org/jira/browse/SPARK-17222 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Robert Kruszewski > > Below should be read as one record and currently it won't be since files and > records are split on new line. > {code} > "aaa","bb > b","ccc" > {code} > This shouldn't be default behaviour due to performance but should be > configurable if necessary -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16334) SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16334: -- Labels: (was: sql) Fix Version/s: (was: 2.0.1) (was: 2.1.0) Component/s: SQL Summary: SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException (was: [SQL] SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException) > SQL query on parquet table java.lang.ArrayIndexOutOfBoundsException > --- > > Key: SPARK-16334 > URL: https://issues.apache.org/jira/browse/SPARK-16334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Assignee: Sameer Agarwal >Priority: Critical > > Query: > {code} > select * from blabla where user_id = 415706251 > {code} > Error: > {code} > 16/06/30 14:07:27 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 0.0 > (TID 3, hadoop6): java.lang.ArrayIndexOutOfBoundsException: 6934 > at > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.decodeToBinary(PlainValuesDictionary.java:119) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.decodeDictionaryIds(VectorizedColumnReader.java:273) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:170) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Work on 1.6.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17229) Postgres JDBC dialect should not widen float and short types during reads
[ https://issues.apache.org/jira/browse/SPARK-17229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17229: Assignee: Apache Spark (was: Josh Rosen) > Postgres JDBC dialect should not widen float and short types during reads > - > > Key: SPARK-17229 > URL: https://issues.apache.org/jira/browse/SPARK-17229 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Minor > > When reading {{float4}} and {{smallint}} columns from PostgreSQL, Spark's > Postgres dialect widens these types to Decimal and Integer rather than using > the narrower Float and Short types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17229) Postgres JDBC dialect should not widen float and short types during reads
[ https://issues.apache.org/jira/browse/SPARK-17229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435665#comment-15435665 ] Apache Spark commented on SPARK-17229: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/14796 > Postgres JDBC dialect should not widen float and short types during reads > - > > Key: SPARK-17229 > URL: https://issues.apache.org/jira/browse/SPARK-17229 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > When reading {{float4}} and {{smallint}} columns from PostgreSQL, Spark's > Postgres dialect widens these types to Decimal and Integer rather than using > the narrower Float and Short types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17229) Postgres JDBC dialect should not widen float and short types during reads
[ https://issues.apache.org/jira/browse/SPARK-17229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17229: Assignee: Josh Rosen (was: Apache Spark) > Postgres JDBC dialect should not widen float and short types during reads > - > > Key: SPARK-17229 > URL: https://issues.apache.org/jira/browse/SPARK-17229 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > When reading {{float4}} and {{smallint}} columns from PostgreSQL, Spark's > Postgres dialect widens these types to Decimal and Integer rather than using > the narrower Float and Short types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17230) Writing decimal to csv will result empty string if the decimal exceeds (20, 18)
Davies Liu created SPARK-17230: -- Summary: Writing decimal to csv will result empty string if the decimal exceeds (20, 18) Key: SPARK-17230 URL: https://issues.apache.org/jira/browse/SPARK-17230 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0, 1.6.2 Reporter: Davies Liu Assignee: Davies Liu {code} // file content spark.read.csv("/mnt/djiang/test-case.csv").show // read in as string and create temp view spark.read.csv("/mnt/djiang/test-case.csv").createOrReplaceTempView("test") // confirm schema spark.table("test").printSchema // apply decimal calculation, confirm the result is correct spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) from test").show(false) // run the same query, and write out as csv spark.sql("select _c0, cast(_c0 as long) * cast('1.0' as decimal(38, 18)) from test").write.csv("/mnt/djiang/test-case-result") // show the content of the result file, particularly, for number exceeded decimal(20, 18), the csv is not writing anything or failing silently spark.read.csv("/mnt/djiang/test-case-result").show +--+ | _c0| +--+ | 1| | 10| | 100| | 1000| | 1| |10| +--+ root |-- _c0: string (nullable = true) +--+-+ |_c0 |(CAST(CAST(CAST(CAST(_c0 AS DECIMAL(20,0)) AS BIGINT) AS DECIMAL(20,0)) AS DECIMAL(38,18)) * CAST(CAST(1.0 AS DECIMAL(38,18)) AS DECIMAL(38,18)))| +--+-+ |1 |1.00 | |10 |10.00 | |100 |100.00 | |1000 |1000.00 | |1 |1.00 | |10|10.00 | +--+-+ +--++ | _c0| _c1| +--++ | 1|1.00| | 10|10.00...| | 100| | | 1000| | | 1| | |10| | +--++ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values
[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435651#comment-15435651 ] Barry Becker commented on SPARK-17219: -- If the decision is to have an additional null/NaN bucket, then I agree that other choices aren't needed. I agree that that the null/NaN bucket can be separate from maxBins (i.e. request 10, but get 11). A couple of other things to consider: - I think there should always be a null/NaN bucket present for the same reason that the first and last bins are -Inf and +Inf respectively. Just because there were no nulls in the training/fitting data does not mean that they will not come through later and need to be placed somewhere. - Currently validation fails if there are fewer than 3 splits specified for a Bucketizer. I actually think that 2 splits should be the minimum - even though that means only 1 bucket! The reason is that some algorithms (like Naive Bayes) may choose to bin features (using MDLP discretization for example) into just 2 buckets - null and non-null. If we now have a null bucket always present, we may just want a single [-Inf, Inf] bucket to for non-nulls - as strange at that sounds. > QuantileDiscretizer does strange things with NaN values > --- > > Key: SPARK-17219 > URL: https://issues.apache.org/jira/browse/SPARK-17219 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.2 >Reporter: Barry Becker > > How is the QuantileDiscretizer supposed to handle null values? > Actual nulls are not allowed, so I replace them with Double.NaN. > However, when you try to run the QuantileDiscretizer on a column that > contains NaNs, it will create (possibly more than one) NaN split(s) before > the final PositiveInfinity value. > I am using the attache titanic csv data and trying to bin the "age" column > using the QuantileDiscretizer with 10 bins specified. The age column as a lot > of null values. > These are the splits that I get: > {code} > -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity > {code} > Is that expected. It seems to imply that NaN is larger than any positive > number and less than infinity. > I'm not sure of the best way to handle nulls, but I think they need a bucket > all their own. My suggestions would be to include an initial NaN split value > that is always there, just like the sentinel Infinities are. If that were the > case, then the splits for the example above might look like this: > {code} > NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity > {code} > This does not seem great either because a bucket that is [NaN, -Inf] doesn't > make much sense. Not sure if the NaN bucket counts toward numBins or not. I > do think it should always be there though in case future data has null even > though the fit data did not. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17229) Postgres JDBC dialect should not widen float and short types during reads
Josh Rosen created SPARK-17229: -- Summary: Postgres JDBC dialect should not widen float and short types during reads Key: SPARK-17229 URL: https://issues.apache.org/jira/browse/SPARK-17229 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen Priority: Minor When reading {{float4}} and {{smallint}} columns from PostgreSQL, Spark's Postgres dialect widens these types to Decimal and Integer rather than using the narrower Float and Short types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org