[jira] [Commented] (SPARK-16863) ProbabilisticClassifier.fit check threshoulds' length
[ https://issues.apache.org/jira/browse/SPARK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405333#comment-15405333 ] zhengruifeng commented on SPARK-16863: -- [SPARK-16851] add checking for {{ProbabilisticClassificationModel.setThreshoulds}}. I just found that {{ProbabilisticClassifier.fit}} also need to check threshoulds' length if {{ProbabilisticClassifier.setThreshoulds}} is called, so I open this JIRA. > ProbabilisticClassifier.fit check threshoulds' length > - > > Key: SPARK-16863 > URL: https://issues.apache.org/jira/browse/SPARK-16863 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Minor > > {code} > val path = > "./spark-2.0.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt" > val data = spark.read.format("libsvm").load(path) > val rf = new RandomForestClassifier > rf.setThresholds(Array(0.1,0.2,0.3,0.4,0.5)) > val rfm = rf.fit(data) > rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = > RandomForestClassificationModel (uid=rfc_fec31a5b954d) with 20 trees > rfm.numClasses > res2: Int = 3 > rfm.getThresholds > res3: Array[Double] = Array(0.1, 0.2, 0.3, 0.4, 0.5) > rfm.transform(data) > java.lang.IllegalArgumentException: requirement failed: > RandomForestClassificationModel.transform() called with non-matching > numClasses and thresholds.length. numClasses=3, but thresholds has length 5 > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:101) > ... 72 elided > {code} > {{ProbabilisticClassifier.fit()}} should throw some exception if it's > threshoulds is set incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405300#comment-15405300 ] Nicholas Chammas edited comment on SPARK-7146 at 8/3/16 4:45 AM: - A quick update from a PySpark user: I am using HasInputCol, HasInputCols, HasLabelCol, and HasOutputCol to create custom transformers, and I find them very handy. I know Python does not have a notion of "private" classes, but knowing these are part of the public API would be good. In summary: The updated proposal looks good to me, with the caveat that I only just started learning the new ML Pipeline API. was (Author: nchammas): A quick update from a PySpark user: I am using HasInputCol, HasInputCols, HasLabelCol, and HasOutputCol to create custom transformers, and I find them very handy. I know Python does not have a notion of "private" classes, but knowing these are part of the public API would be good. I summary: The updated proposal looks good to me, with the caveat that I only just started learning the new ML Pipeline API. > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > Proposal: Make most of the Param traits in sharedParams.scala public. Mark > them as DeveloperApi. > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > h3. UPDATED proposal > * Some Params are clearly safe to make public. We will do so. > * Some Params could be made public but may require caveats in the trait doc. > * Some Params have turned out not to be shared in practice. We can move > those Params to the classes which use them. > *Public shared params*: > * I/O column params > ** HasFeaturesCol > ** HasInputCol > ** HasInputCols > ** HasLabelCol > ** HasOutputCol > ** HasPredictionCol > ** HasProbabilityCol > ** HasRawPredictionCol > ** HasVarianceCol > ** HasWeightCol > * Algorithm settings > ** HasCheckpointInterval > ** HasElasticNetParam > ** HasFitIntercept > ** HasMaxIter > ** HasRegParam > ** HasSeed > ** HasStandardization (less common) > ** HasStepSize > ** HasTol > *Questionable params*: > * HasHandleInvalid (only used in StringIndexer, but might be more widely used > later on) > * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but > same meaning as Optimizer in LDA) > *Params to be removed from sharedParams*: > * HasThreshold (only used in LogisticRegression) > * HasThresholds (only used in ProbabilisticClassifier) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7146) Should ML sharedParams be a public API?
[ https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405300#comment-15405300 ] Nicholas Chammas commented on SPARK-7146: - A quick update from a PySpark user: I am using HasInputCol, HasInputCols, HasLabelCol, and HasOutputCol to create custom transformers, and I find them very handy. I know Python does not have a notion of "private" classes, but knowing these are part of the public API would be good. I summary: The updated proposal looks good to me, with the caveat that I only just started learning the new ML Pipeline API. > Should ML sharedParams be a public API? > --- > > Key: SPARK-7146 > URL: https://issues.apache.org/jira/browse/SPARK-7146 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > Proposal: Make most of the Param traits in sharedParams.scala public. Mark > them as DeveloperApi. > Pros: > * Sharing the Param traits helps to encourage standardized Param names and > documentation. > Cons: > * Users have to be careful since parameters can have different meanings for > different algorithms. > * If the shared Params are public, then implementations could test for the > traits. It is unclear if we want users to rely on these traits, which are > somewhat experimental. > Currently, the shared params are private. > h3. UPDATED proposal > * Some Params are clearly safe to make public. We will do so. > * Some Params could be made public but may require caveats in the trait doc. > * Some Params have turned out not to be shared in practice. We can move > those Params to the classes which use them. > *Public shared params*: > * I/O column params > ** HasFeaturesCol > ** HasInputCol > ** HasInputCols > ** HasLabelCol > ** HasOutputCol > ** HasPredictionCol > ** HasProbabilityCol > ** HasRawPredictionCol > ** HasVarianceCol > ** HasWeightCol > * Algorithm settings > ** HasCheckpointInterval > ** HasElasticNetParam > ** HasFitIntercept > ** HasMaxIter > ** HasRegParam > ** HasSeed > ** HasStandardization (less common) > ** HasStepSize > ** HasTol > *Questionable params*: > * HasHandleInvalid (only used in StringIndexer, but might be more widely used > later on) > * HasSolver (used in LinearRegression and GeneralizedLinearRegression, but > same meaning as Optimizer in LDA) > *Params to be removed from sharedParams*: > * HasThreshold (only used in LogisticRegression) > * HasThresholds (only used in ProbabilisticClassifier) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16863) ProbabilisticClassifier.fit check threshoulds' length
[ https://issues.apache.org/jira/browse/SPARK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405276#comment-15405276 ] Sean Owen commented on SPARK-16863: --- Wait, how is this different from SPARK-16851? I don't see why these were opened as different JIRAs. > ProbabilisticClassifier.fit check threshoulds' length > - > > Key: SPARK-16863 > URL: https://issues.apache.org/jira/browse/SPARK-16863 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Minor > > {code} > val path = > "./spark-2.0.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt" > val data = spark.read.format("libsvm").load(path) > val rf = new RandomForestClassifier > rf.setThresholds(Array(0.1,0.2,0.3,0.4,0.5)) > val rfm = rf.fit(data) > rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = > RandomForestClassificationModel (uid=rfc_fec31a5b954d) with 20 trees > rfm.numClasses > res2: Int = 3 > rfm.getThresholds > res3: Array[Double] = Array(0.1, 0.2, 0.3, 0.4, 0.5) > rfm.transform(data) > java.lang.IllegalArgumentException: requirement failed: > RandomForestClassificationModel.transform() called with non-matching > numClasses and thresholds.length. numClasses=3, but thresholds has length 5 > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:101) > ... 72 elided > {code} > {{ProbabilisticClassifier.fit()}} should throw some exception if it's > threshoulds is set incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16862) Configurable buffer size in `UnsafeSorterSpillReader`
[ https://issues.apache.org/jira/browse/SPARK-16862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16862: Assignee: (was: Apache Spark) > Configurable buffer size in `UnsafeSorterSpillReader` > - > > Key: SPARK-16862 > URL: https://issues.apache.org/jira/browse/SPARK-16862 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Tejas Patil >Priority: Minor > > `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k > buffer to read data off disk. This could be made configurable to improve on > disk reads. > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java#L53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16862) Configurable buffer size in `UnsafeSorterSpillReader`
[ https://issues.apache.org/jira/browse/SPARK-16862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405245#comment-15405245 ] Apache Spark commented on SPARK-16862: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/14475 > Configurable buffer size in `UnsafeSorterSpillReader` > - > > Key: SPARK-16862 > URL: https://issues.apache.org/jira/browse/SPARK-16862 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Tejas Patil >Priority: Minor > > `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k > buffer to read data off disk. This could be made configurable to improve on > disk reads. > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java#L53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16862) Configurable buffer size in `UnsafeSorterSpillReader`
[ https://issues.apache.org/jira/browse/SPARK-16862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16862: Assignee: Apache Spark > Configurable buffer size in `UnsafeSorterSpillReader` > - > > Key: SPARK-16862 > URL: https://issues.apache.org/jira/browse/SPARK-16862 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Tejas Patil >Assignee: Apache Spark >Priority: Minor > > `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k > buffer to read data off disk. This could be made configurable to improve on > disk reads. > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java#L53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16853) Analysis error for DataSet typed selection
[ https://issues.apache.org/jira/browse/SPARK-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16853: Assignee: (was: Apache Spark) > Analysis error for DataSet typed selection > -- > > Key: SPARK-16853 > URL: https://issues.apache.org/jira/browse/SPARK-16853 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong > > For DataSet typed selection > {code} > def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1] > {code} > If U1 contains sub-fields, then it reports AnalysisException > Reproducer: > {code} > scala> case class A(a: Int, b: Int) > scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]) > org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input > columns: [_2]; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16853) Analysis error for DataSet typed selection
[ https://issues.apache.org/jira/browse/SPARK-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405220#comment-15405220 ] Apache Spark commented on SPARK-16853: -- User 'clockfly' has created a pull request for this issue: https://github.com/apache/spark/pull/14474 > Analysis error for DataSet typed selection > -- > > Key: SPARK-16853 > URL: https://issues.apache.org/jira/browse/SPARK-16853 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong > > For DataSet typed selection > {code} > def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1] > {code} > If U1 contains sub-fields, then it reports AnalysisException > Reproducer: > {code} > scala> case class A(a: Int, b: Int) > scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]) > org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input > columns: [_2]; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16853) Analysis error for DataSet typed selection
[ https://issues.apache.org/jira/browse/SPARK-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16853: Assignee: Apache Spark > Analysis error for DataSet typed selection > -- > > Key: SPARK-16853 > URL: https://issues.apache.org/jira/browse/SPARK-16853 > Project: Spark > Issue Type: Bug >Reporter: Sean Zhong >Assignee: Apache Spark > > For DataSet typed selection > {code} > def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1] > {code} > If U1 contains sub-fields, then it reports AnalysisException > Reproducer: > {code} > scala> case class A(a: Int, b: Int) > scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]) > org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input > columns: [_2]; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16495) Add ADMM optimizer in mllib package
[ https://issues.apache.org/jira/browse/SPARK-16495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405217#comment-15405217 ] Apache Spark commented on SPARK-16495: -- User 'ZunwenYou' has created a pull request for this issue: https://github.com/apache/spark/pull/14473 > Add ADMM optimizer in mllib package > --- > > Key: SPARK-16495 > URL: https://issues.apache.org/jira/browse/SPARK-16495 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: zunwen you > > Alternating Direction Method of Multipliers (ADMM) is well suited to > distributed convex optimization, and in particular to large-scale problems > arising in statistics, machine learning, and related areas. > Details can be found in the [S. Boyd's > paper](http://www.stanford.edu/~boyd/papers/admm_distr_stats.html). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16866) Basic infrastructure for file-based SQL end-to-end tests
[ https://issues.apache.org/jira/browse/SPARK-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16866: Assignee: Apache Spark > Basic infrastructure for file-based SQL end-to-end tests > > > Key: SPARK-16866 > URL: https://issues.apache.org/jira/browse/SPARK-16866 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Peter Lee >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16866) Basic infrastructure for file-based SQL end-to-end tests
[ https://issues.apache.org/jira/browse/SPARK-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405203#comment-15405203 ] Apache Spark commented on SPARK-16866: -- User 'petermaxlee' has created a pull request for this issue: https://github.com/apache/spark/pull/14472 > Basic infrastructure for file-based SQL end-to-end tests > > > Key: SPARK-16866 > URL: https://issues.apache.org/jira/browse/SPARK-16866 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Peter Lee > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16866) Basic infrastructure for file-based SQL end-to-end tests
[ https://issues.apache.org/jira/browse/SPARK-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16866: Assignee: (was: Apache Spark) > Basic infrastructure for file-based SQL end-to-end tests > > > Key: SPARK-16866 > URL: https://issues.apache.org/jira/browse/SPARK-16866 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Peter Lee > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16866) Basic infrastructure for file-based SQL end-to-end tests
Peter Lee created SPARK-16866: - Summary: Basic infrastructure for file-based SQL end-to-end tests Key: SPARK-16866 URL: https://issues.apache.org/jira/browse/SPARK-16866 Project: Spark Issue Type: Sub-task Reporter: Peter Lee -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16865) A file-based end-to-end SQL query suite
Peter Lee created SPARK-16865: - Summary: A file-based end-to-end SQL query suite Key: SPARK-16865 URL: https://issues.apache.org/jira/browse/SPARK-16865 Project: Spark Issue Type: Improvement Components: SQL Reporter: Peter Lee Spark currently has a large number of end-to-end SQL test cases in various SQLQuerySuites. It is fairly difficult to manage and operate these end-to-end test cases. This ticket proposes that we use files to specify queries and expected outputs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14387) Exceptions thrown when querying ORC tables
[ https://issues.apache.org/jira/browse/SPARK-14387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405192#comment-15405192 ] Apache Spark commented on SPARK-14387: -- User 'rajeshbalamohan' has created a pull request for this issue: https://github.com/apache/spark/pull/14471 > Exceptions thrown when querying ORC tables > -- > > Key: SPARK-14387 > URL: https://issues.apache.org/jira/browse/SPARK-14387 > Project: Spark > Issue Type: Bug >Reporter: Rajesh Balamohan > > In master branch, I tried to run TPC-DS queries (e.g Query27) at 200 GB > scale. Initially I got the following exception (as FileScanRDD has been made > the default in master branch) > {noformat} > 16/04/04 06:49:55 WARN TaskSetManager: Lost task 0.0 in stage 15.0. > java.lang.IllegalArgumentException: Field "s_store_sk" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:157) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:146) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:69) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:60) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$6$$anon$1.hasNext(WholeStageCodegen.scala:361) > {noformat} > When running with "spark.sql.sources.fileScan=false", following exception is > thrown > {noformat} > 16/04/04 09:02:00 ERROR SparkExecuteStatementOperation: Error executing > query, currentState RUNNING, > java.lang.IllegalArgumentException: Field "cd_demo_sk" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at > org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410) > at > org.apache.spark.sql.hive.orc.OrcTableScan.execute(OrcRelation.scala:317) > at > org.apache.spark.sql.hive.orc.DefaultSource.buildInternalScan(OrcRelation.scala:124) > at >
[jira] [Created] (SPARK-16864) Comprehensive version info
jay vyas created SPARK-16864: Summary: Comprehensive version info Key: SPARK-16864 URL: https://issues.apache.org/jira/browse/SPARK-16864 Project: Spark Issue Type: Improvement Reporter: jay vyas Spark versions can be grepped out of the Spark banner that comes up on startup, but otherwise, there is no programmatic/reliable way to get version information. Also there is no git commit id, etc. So precise version checking isnt possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16863) ProbabilisticClassifier.fit check threshoulds' length
[ https://issues.apache.org/jira/browse/SPARK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16863: Assignee: Apache Spark > ProbabilisticClassifier.fit check threshoulds' length > - > > Key: SPARK-16863 > URL: https://issues.apache.org/jira/browse/SPARK-16863 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > {code} > val path = > "./spark-2.0.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt" > val data = spark.read.format("libsvm").load(path) > val rf = new RandomForestClassifier > rf.setThresholds(Array(0.1,0.2,0.3,0.4,0.5)) > val rfm = rf.fit(data) > rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = > RandomForestClassificationModel (uid=rfc_fec31a5b954d) with 20 trees > rfm.numClasses > res2: Int = 3 > rfm.getThresholds > res3: Array[Double] = Array(0.1, 0.2, 0.3, 0.4, 0.5) > rfm.transform(data) > java.lang.IllegalArgumentException: requirement failed: > RandomForestClassificationModel.transform() called with non-matching > numClasses and thresholds.length. numClasses=3, but thresholds has length 5 > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:101) > ... 72 elided > {code} > {{ProbabilisticClassifier.fit()}} should throw some exception if it's > threshoulds is set incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16863) ProbabilisticClassifier.fit check threshoulds' length
[ https://issues.apache.org/jira/browse/SPARK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405182#comment-15405182 ] Apache Spark commented on SPARK-16863: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/14470 > ProbabilisticClassifier.fit check threshoulds' length > - > > Key: SPARK-16863 > URL: https://issues.apache.org/jira/browse/SPARK-16863 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Minor > > {code} > val path = > "./spark-2.0.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt" > val data = spark.read.format("libsvm").load(path) > val rf = new RandomForestClassifier > rf.setThresholds(Array(0.1,0.2,0.3,0.4,0.5)) > val rfm = rf.fit(data) > rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = > RandomForestClassificationModel (uid=rfc_fec31a5b954d) with 20 trees > rfm.numClasses > res2: Int = 3 > rfm.getThresholds > res3: Array[Double] = Array(0.1, 0.2, 0.3, 0.4, 0.5) > rfm.transform(data) > java.lang.IllegalArgumentException: requirement failed: > RandomForestClassificationModel.transform() called with non-matching > numClasses and thresholds.length. numClasses=3, but thresholds has length 5 > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:101) > ... 72 elided > {code} > {{ProbabilisticClassifier.fit()}} should throw some exception if it's > threshoulds is set incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16863) ProbabilisticClassifier.fit check threshoulds' length
[ https://issues.apache.org/jira/browse/SPARK-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16863: Assignee: (was: Apache Spark) > ProbabilisticClassifier.fit check threshoulds' length > - > > Key: SPARK-16863 > URL: https://issues.apache.org/jira/browse/SPARK-16863 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Minor > > {code} > val path = > "./spark-2.0.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt" > val data = spark.read.format("libsvm").load(path) > val rf = new RandomForestClassifier > rf.setThresholds(Array(0.1,0.2,0.3,0.4,0.5)) > val rfm = rf.fit(data) > rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = > RandomForestClassificationModel (uid=rfc_fec31a5b954d) with 20 trees > rfm.numClasses > res2: Int = 3 > rfm.getThresholds > res3: Array[Double] = Array(0.1, 0.2, 0.3, 0.4, 0.5) > rfm.transform(data) > java.lang.IllegalArgumentException: requirement failed: > RandomForestClassificationModel.transform() called with non-matching > numClasses and thresholds.length. numClasses=3, but thresholds has length 5 > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:101) > ... 72 elided > {code} > {{ProbabilisticClassifier.fit()}} should throw some exception if it's > threshoulds is set incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16863) ProbabilisticClassifier.fit check threshoulds' length
zhengruifeng created SPARK-16863: Summary: ProbabilisticClassifier.fit check threshoulds' length Key: SPARK-16863 URL: https://issues.apache.org/jira/browse/SPARK-16863 Project: Spark Issue Type: Improvement Components: ML Reporter: zhengruifeng Priority: Minor {code} val path = "./spark-2.0.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt" val data = spark.read.format("libsvm").load(path) val rf = new RandomForestClassifier rf.setThresholds(Array(0.1,0.2,0.3,0.4,0.5)) val rfm = rf.fit(data) rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = RandomForestClassificationModel (uid=rfc_fec31a5b954d) with 20 trees rfm.numClasses res2: Int = 3 rfm.getThresholds res3: Array[Double] = Array(0.1, 0.2, 0.3, 0.4, 0.5) rfm.transform(data) java.lang.IllegalArgumentException: requirement failed: RandomForestClassificationModel.transform() called with non-matching numClasses and thresholds.length. numClasses=3, but thresholds has length 5 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:101) ... 72 elided {code} {{ProbabilisticClassifier.fit()}} should throw some exception if it's threshoulds is set incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16862) Configurable buffer size in `UnsafeSorterSpillReader`
Tejas Patil created SPARK-16862: --- Summary: Configurable buffer size in `UnsafeSorterSpillReader` Key: SPARK-16862 URL: https://issues.apache.org/jira/browse/SPARK-16862 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.0.0 Reporter: Tejas Patil Priority: Minor `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k buffer to read data off disk. This could be made configurable to improve on disk reads. https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java#L53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14453) Remove SPARK_JAVA_OPTS environment variable
[ https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405112#comment-15405112 ] Saisai Shao commented on SPARK-14453: - If you want to fix this issue, it would be better target to SPARK-12344. > Remove SPARK_JAVA_OPTS environment variable > --- > > Key: SPARK-14453 > URL: https://issues.apache.org/jira/browse/SPARK-14453 > Project: Spark > Issue Type: Task > Components: Spark Core, YARN >Reporter: Saisai Shao >Priority: Minor > > SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version > (2.0), I think it would be better to remove the support of this env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16610) When writing ORC files, orc.compress should not be overridden if users do not set "compression" in the options
[ https://issues.apache.org/jira/browse/SPARK-16610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405071#comment-15405071 ] Hyukjin Kwon commented on SPARK-16610: -- One thought is, we might have to document that we don't respect Hadoop configuration anymore officially if it is. > When writing ORC files, orc.compress should not be overridden if users do not > set "compression" in the options > -- > > Key: SPARK-16610 > URL: https://issues.apache.org/jira/browse/SPARK-16610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yin Huai > > For ORC source, Spark SQL has a writer option {{compression}}, which is used > to set the codec and its value will be also set to orc.compress (the orc conf > used for codec). However, if a user only set {{orc.compress}} in the writer > option, we should not use the default value of "compression" (snappy) as the > codec. Instead, we should respect the value of {{orc.compress}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405069#comment-15405069 ] Xusen Yin commented on SPARK-16857: --- I agree the cluster assignments could be arbitrary. Yes under this condition we shouldn't use MulticlassClassificationEvaluator to evaluate the result. > CrossValidator and KMeans throws IllegalArgumentException > - > > Key: SPARK-16857 > URL: https://issues.apache.org/jira/browse/SPARK-16857 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 > Environment: spark-jobserver docker image. Spark 1.6.1 on ubuntu, > Hadoop 2.4 >Reporter: Ryan Claussen > > I am attempting to use CrossValidation to train KMeans model. When I attempt > to fit the data spark throws an IllegalArgumentException as below since the > KMeans algorithm outputs an Integer into the prediction column instead of a > Double. Before I go too far: is using CrossValidation with Kmeans > supported? > Here's the exception: > {quote} > java.lang.IllegalArgumentException: requirement failed: Column prediction > must be of type DoubleType but was actually IntegerType. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39) > at > spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > Here is the code I'm using to set up my cross validator. As the stack trace > above indicates it is failing at the fit step when > {quote} > ... > val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures") > val labelConverter = new > IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels) > val pipeline = new Pipeline().setStages(Array(labelIndexer, > featureIndexer, mpc, labelConverter)) > val evaluator = new > MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction") > val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, > 200, 500)).build() > val cv = new > CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3) > val cvModel = cv.fit(trainingData) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405057#comment-15405057 ] Sean Zhong commented on SPARK-16320: [~maver1ck] Did you use the test case in this jira {code} select count(*) where nested_column.id > some_id {code} Or the test case in SPARK-16321? {code} df = sqlctx.read.parquet(path) df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 else []).collect() {code} There is a slight difference. In the first test case, the filter condition uses nested fields, the filter push down is not supported. So, your PR 14465 should not be effective. Basically SPARK-16320 and SPARK-16321 are different problems. > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405055#comment-15405055 ] Sean Owen commented on SPARK-16857: --- I don't think that in general it makes sense to use MulticlassClassificationEvaluator with k-means, since there is in general no known correct 'label' (cluster assignment). If you do happen to have these assignments, yes you can do something like the above, but I might expect you do have to do a little extra work like convert the integer cluster assignment into a double. That's arguably as it should be, I think. (Keep in mind that cluster numbering is arbitrary; are you sure this will be valid to compare against a fixed set of cluster assignments?) There are other k-means evaluation metrics, but they aren't evaluations on the cluster assignment itself, but things like intra-cluster distance. > CrossValidator and KMeans throws IllegalArgumentException > - > > Key: SPARK-16857 > URL: https://issues.apache.org/jira/browse/SPARK-16857 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 > Environment: spark-jobserver docker image. Spark 1.6.1 on ubuntu, > Hadoop 2.4 >Reporter: Ryan Claussen > > I am attempting to use CrossValidation to train KMeans model. When I attempt > to fit the data spark throws an IllegalArgumentException as below since the > KMeans algorithm outputs an Integer into the prediction column instead of a > Double. Before I go too far: is using CrossValidation with Kmeans > supported? > Here's the exception: > {quote} > java.lang.IllegalArgumentException: requirement failed: Column prediction > must be of type DoubleType but was actually IntegerType. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39) > at > spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > Here is the code I'm using to set up my cross validator. As the stack trace > above indicates it is failing at the fit step when > {quote} > ... > val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures") > val labelConverter = new > IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels) > val pipeline = new Pipeline().setStages(Array(labelIndexer, > featureIndexer, mpc, labelConverter)) > val evaluator = new > MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction") > val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, > 200, 500)).build() > val cv = new > CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3) > val cvModel = cv.fit(trainingData) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405050#comment-15405050 ] Xusen Yin commented on SPARK-16857: --- Using CrossValidator with KMeans should be supported. As a kind of external evaluation for KMeans, I think using MulticlassClassificationEvaluator with KMeans should also be supported. Why not send a PR since it would be a quick fix. CC [~yanboliang] > CrossValidator and KMeans throws IllegalArgumentException > - > > Key: SPARK-16857 > URL: https://issues.apache.org/jira/browse/SPARK-16857 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 > Environment: spark-jobserver docker image. Spark 1.6.1 on ubuntu, > Hadoop 2.4 >Reporter: Ryan Claussen > > I am attempting to use CrossValidation to train KMeans model. When I attempt > to fit the data spark throws an IllegalArgumentException as below since the > KMeans algorithm outputs an Integer into the prediction column instead of a > Double. Before I go too far: is using CrossValidation with Kmeans > supported? > Here's the exception: > {quote} > java.lang.IllegalArgumentException: requirement failed: Column prediction > must be of type DoubleType but was actually IntegerType. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39) > at > spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > Here is the code I'm using to set up my cross validator. As the stack trace > above indicates it is failing at the fit step when > {quote} > ... > val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures") > val labelConverter = new > IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels) > val pipeline = new Pipeline().setStages(Array(labelIndexer, > featureIndexer, mpc, labelConverter)) > val evaluator = new > MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction") > val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, > 200, 500)).build() > val cv = new > CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3) > val cvModel = cv.fit(trainingData) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405044#comment-15405044 ] Apache Spark commented on SPARK-16700: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/14469 > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16700: Assignee: (was: Apache Spark) > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405043#comment-15405043 ] Davies Liu commented on SPARK-16700: Sent PR https://github.com/apache/spark/pull/14469 to address these, could you help to test and review them? > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16700: Assignee: Apache Spark > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Assignee: Apache Spark > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-16700: -- Assignee: Davies Liu > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Assignee: Davies Liu > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15541) SparkContext.stop throws error
[ https://issues.apache.org/jira/browse/SPARK-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15541: -- Fix Version/s: 2.1.0 2.0.1 1.6.3 > SparkContext.stop throws error > -- > > Key: SPARK-15541 > URL: https://issues.apache.org/jira/browse/SPARK-15541 > Project: Spark > Issue Type: Bug >Reporter: Miao Wang >Assignee: Maciej Bryński > Fix For: 1.6.3, 2.0.1, 2.1.0 > > > When running unit-tests or examples from command line or Intellij, > SparkContext throws errors. > For example: > ./bin/run-example ml.JavaNaiveBayesExample > Exception in thread "main" 16/05/25 15:17:55 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > java.lang.NoSuchMethodError: > java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView; > at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:176) > at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:291) > at > org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:269) > at org.apache.spark.SparkEnv.stop(SparkEnv.scala:91) > at > org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1796) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1795) > at org.apache.spark.sql.SparkSession.stop(SparkSession.scala:577) > at > org.apache.spark.examples.ml.JavaNaiveBayesExample.main(JavaNaiveBayesExample.java:61) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 16/05/25 15:17:55 INFO ShutdownHookManager: Shutdown hook called -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16671) Merge variable substitution code in core and SQL
[ https://issues.apache.org/jira/browse/SPARK-16671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16671: Assignee: (was: Apache Spark) > Merge variable substitution code in core and SQL > > > Key: SPARK-16671 > URL: https://issues.apache.org/jira/browse/SPARK-16671 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Priority: Minor > > SPARK-16272 added support for variable substitution in configs in the core > Spark configuration. That code has a lot of similarities with SQL's > {{VariableSubstitution}}, and we should make both use the same code as much > as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16671) Merge variable substitution code in core and SQL
[ https://issues.apache.org/jira/browse/SPARK-16671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16671: Assignee: Apache Spark > Merge variable substitution code in core and SQL > > > Key: SPARK-16671 > URL: https://issues.apache.org/jira/browse/SPARK-16671 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > SPARK-16272 added support for variable substitution in configs in the core > Spark configuration. That code has a lot of similarities with SQL's > {{VariableSubstitution}}, and we should make both use the same code as much > as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16671) Merge variable substitution code in core and SQL
[ https://issues.apache.org/jira/browse/SPARK-16671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405022#comment-15405022 ] Apache Spark commented on SPARK-16671: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/14468 > Merge variable substitution code in core and SQL > > > Key: SPARK-16671 > URL: https://issues.apache.org/jira/browse/SPARK-16671 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Priority: Minor > > SPARK-16272 added support for variable substitution in configs in the core > Spark configuration. That code has a lot of similarities with SQL's > {{VariableSubstitution}}, and we should make both use the same code as much > as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16796) Visible passwords on Spark environment page
[ https://issues.apache.org/jira/browse/SPARK-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16796. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14409 [https://github.com/apache/spark/pull/14409] > Visible passwords on Spark environment page > --- > > Key: SPARK-16796 > URL: https://issues.apache.org/jira/browse/SPARK-16796 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Artur >Assignee: Artur >Priority: Trivial > Fix For: 2.1.0 > > Attachments: > Mask_spark_ssl_keyPassword_spark_ssl_keyStorePassword_spark_ssl_trustStorePassword_from_We1.patch > > > Spark properties (passwords): > spark.ssl.keyPassword,spark.ssl.keyStorePassword,spark.ssl.trustStorePassword > are visible in Web UI in environment page. > Can we mask them from Web UI? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16861) Refactor PySpark accumulator API to be on top of AccumulatorV2 API
[ https://issues.apache.org/jira/browse/SPARK-16861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16861: Assignee: (was: Apache Spark) > Refactor PySpark accumulator API to be on top of AccumulatorV2 API > -- > > Key: SPARK-16861 > URL: https://issues.apache.org/jira/browse/SPARK-16861 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: holdenk > > The PySpark accumulator API is implemented on top of the legacy accumulator > API which is now deprecated. To allow this deprecated API to be removed > eventually we will need to rewrite the PySpark accumulators on top of the new > API. An open question for a possible follow up is if we want to make the > Python accumulator API look more like the new accumulator API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16861) Refactor PySpark accumulator API to be on top of AccumulatorV2 API
[ https://issues.apache.org/jira/browse/SPARK-16861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404891#comment-15404891 ] Apache Spark commented on SPARK-16861: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/14467 > Refactor PySpark accumulator API to be on top of AccumulatorV2 API > -- > > Key: SPARK-16861 > URL: https://issues.apache.org/jira/browse/SPARK-16861 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: holdenk > > The PySpark accumulator API is implemented on top of the legacy accumulator > API which is now deprecated. To allow this deprecated API to be removed > eventually we will need to rewrite the PySpark accumulators on top of the new > API. An open question for a possible follow up is if we want to make the > Python accumulator API look more like the new accumulator API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16861) Refactor PySpark accumulator API to be on top of AccumulatorV2 API
[ https://issues.apache.org/jira/browse/SPARK-16861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16861: Assignee: Apache Spark > Refactor PySpark accumulator API to be on top of AccumulatorV2 API > -- > > Key: SPARK-16861 > URL: https://issues.apache.org/jira/browse/SPARK-16861 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: holdenk >Assignee: Apache Spark > > The PySpark accumulator API is implemented on top of the legacy accumulator > API which is now deprecated. To allow this deprecated API to be removed > eventually we will need to rewrite the PySpark accumulators on top of the new > API. An open question for a possible follow up is if we want to make the > Python accumulator API look more like the new accumulator API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16861) Refactor PySpark accumulator API to be on top of AccumulatorV2 API
holdenk created SPARK-16861: --- Summary: Refactor PySpark accumulator API to be on top of AccumulatorV2 API Key: SPARK-16861 URL: https://issues.apache.org/jira/browse/SPARK-16861 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Reporter: holdenk The PySpark accumulator API is implemented on top of the legacy accumulator API which is now deprecated. To allow this deprecated API to be removed eventually we will need to rewrite the PySpark accumulators on top of the new API. An open question for a possible follow up is if we want to make the Python accumulator API look more like the new accumulator API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16858) Removal of TestHiveSharedState
[ https://issues.apache.org/jira/browse/SPARK-16858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16858. - Resolution: Fixed Assignee: Xiao Li Fix Version/s: 2.1.0 > Removal of TestHiveSharedState > -- > > Key: SPARK-16858 > URL: https://issues.apache.org/jira/browse/SPARK-16858 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.0 > > > Remove TestHiveSharedState. Otherwise, we are not really testing the > reflection logic based on the setting of we are not really testing the > reflection logic based on the setting of CATALOG_IMPLEMENTATION. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16321) Spark 2.0 performance drop vs Spark 1.6 when reading parquet file
[ https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404676#comment-15404676 ] Apache Spark commented on SPARK-16321: -- User 'maver1ck' has created a pull request for this issue: https://github.com/apache/spark/pull/14465 > Spark 2.0 performance drop vs Spark 1.6 when reading parquet file > - > > Key: SPARK-16321 > URL: https://issues.apache.org/jira/browse/SPARK-16321 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, > spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, > spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, > visualvm_spark2_G1GC.png > > > *UPDATE* > Please start with this comment > https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785 > I assume that problem results from the performance problem with reading > parquet files > *Original Issue description* > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is 2x slower. > {code} > df = sqlctx.read.parquet(path) > df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 > else []).collect() > {code} > Spark 1.6 -> 2.3 min > Spark 2.0 -> 4.6 min (2x slower) > I used BasicProfiler for this task and cumulative time was: > Spark 1.6 - 4300 sec > Spark 2.0 - 5800 sec > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16321) Spark 2.0 performance drop vs Spark 1.6 when reading parquet file
[ https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16321: Assignee: (was: Apache Spark) > Spark 2.0 performance drop vs Spark 1.6 when reading parquet file > - > > Key: SPARK-16321 > URL: https://issues.apache.org/jira/browse/SPARK-16321 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, > spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, > spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, > visualvm_spark2_G1GC.png > > > *UPDATE* > Please start with this comment > https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785 > I assume that problem results from the performance problem with reading > parquet files > *Original Issue description* > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is 2x slower. > {code} > df = sqlctx.read.parquet(path) > df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 > else []).collect() > {code} > Spark 1.6 -> 2.3 min > Spark 2.0 -> 4.6 min (2x slower) > I used BasicProfiler for this task and cumulative time was: > Spark 1.6 - 4300 sec > Spark 2.0 - 5800 sec > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16321) Spark 2.0 performance drop vs Spark 1.6 when reading parquet file
[ https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16321: Assignee: Apache Spark > Spark 2.0 performance drop vs Spark 1.6 when reading parquet file > - > > Key: SPARK-16321 > URL: https://issues.apache.org/jira/browse/SPARK-16321 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Apache Spark >Priority: Critical > Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, > spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, > spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, > visualvm_spark2_G1GC.png > > > *UPDATE* > Please start with this comment > https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785 > I assume that problem results from the performance problem with reading > parquet files > *Original Issue description* > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is 2x slower. > {code} > df = sqlctx.read.parquet(path) > df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 > else []).collect() > {code} > Spark 1.6 -> 2.3 min > Spark 2.0 -> 4.6 min (2x slower) > I used BasicProfiler for this task and cumulative time was: > Spark 1.6 - 4300 sec > Spark 2.0 - 5800 sec > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404675#comment-15404675 ] Apache Spark commented on SPARK-16320: -- User 'maver1ck' has created a pull request for this issue: https://github.com/apache/spark/pull/14465 > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16321) Spark 2.0 performance drop vs Spark 1.6 when reading parquet file
[ https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404654#comment-15404654 ] Maciej Bryński commented on SPARK-16321: [~smilegator] spark.sql.parquet.filterPushdown has true as a default. Vectorized Reader isn't a case here because I have nested columns (and Vectorized Reader works only with Atomic Types) > Spark 2.0 performance drop vs Spark 1.6 when reading parquet file > - > > Key: SPARK-16321 > URL: https://issues.apache.org/jira/browse/SPARK-16321 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, > spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, > spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, > visualvm_spark2_G1GC.png > > > *UPDATE* > Please start with this comment > https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785 > I assume that problem results from the performance problem with reading > parquet files > *Original Issue description* > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is 2x slower. > {code} > df = sqlctx.read.parquet(path) > df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 > else []).collect() > {code} > Spark 1.6 -> 2.3 min > Spark 2.0 -> 4.6 min (2x slower) > I used BasicProfiler for this task and cumulative time was: > Spark 1.6 - 4300 sec > Spark 2.0 - 5800 sec > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16860) UDT Stringification Incorrect in PySpark
Vladimir Feinberg created SPARK-16860: - Summary: UDT Stringification Incorrect in PySpark Key: SPARK-16860 URL: https://issues.apache.org/jira/browse/SPARK-16860 Project: Spark Issue Type: Bug Components: PySpark Reporter: Vladimir Feinberg Priority: Minor When using `show()` on a `DataFrame` containing a UDT, Spark doesn't call the appropriate `__str__` method for display. Example: https://gist.github.com/vlad17/baa8e18ed724c4d88436a92ca159dd5b -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16321) Spark 2.0 performance drop vs Spark 1.6 when reading parquet file
[ https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404652#comment-15404652 ] Xiao Li commented on SPARK-16321: - Can you set `spark.sql.parquet.enableVectorizedReader` to false and `spark.sql.parquet.filterPushdown` to true? BTW, keep the other settings unchanged. > Spark 2.0 performance drop vs Spark 1.6 when reading parquet file > - > > Key: SPARK-16321 > URL: https://issues.apache.org/jira/browse/SPARK-16321 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: Spark16.nps, Spark2.nps, spark16._trace.png, > spark16_query.nps, spark2_nofilterpushdown.nps, spark2_query.nps, > spark2_trace.png, visualvm_spark16.png, visualvm_spark2.png, > visualvm_spark2_G1GC.png > > > *UPDATE* > Please start with this comment > https://issues.apache.org/jira/browse/SPARK-16321?focusedCommentId=15383785=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15383785 > I assume that problem results from the performance problem with reading > parquet files > *Original Issue description* > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is 2x slower. > {code} > df = sqlctx.read.parquet(path) > df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 > else []).collect() > {code} > Spark 1.6 -> 2.3 min > Spark 2.0 -> 4.6 min (2x slower) > I used BasicProfiler for this task and cumulative time was: > Spark 1.6 - 4300 sec > Spark 2.0 - 5800 sec > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404646#comment-15404646 ] Maciej Bryński edited comment on SPARK-16320 at 8/2/16 7:38 PM: [~michael], [~yhuai], [~rxin] I think this is smallest change that resolves my problem. {code} diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala index f1c78bb..2b84885 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala @@ -323,6 +323,8 @@ private[sql] class ParquetFileFormat None } +pushed.foreach(ParquetInputFormat.setFilterPredicate(hadoopConf, _)) + val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) {code} Can anybody tell why it works ? was (Author: maver1ck): [~michael], [~yhuai] I think this is smallest change that resolves my problem. {code} diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala index f1c78bb..2b84885 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala @@ -323,6 +323,8 @@ private[sql] class ParquetFileFormat None } +pushed.foreach(ParquetInputFormat.setFilterPredicate(hadoopConf, _)) + val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) {code} Can anybody tell why it works ? > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. >
[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404646#comment-15404646 ] Maciej Bryński edited comment on SPARK-16320 at 8/2/16 7:37 PM: [~michael], [~yhuai] I think this is smallest change that resolves my problem. {code} diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala index f1c78bb..2b84885 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala @@ -323,6 +323,8 @@ private[sql] class ParquetFileFormat None } +pushed.foreach(ParquetInputFormat.setFilterPredicate(hadoopConf, _)) + val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) {code} Can anybody tell why it works ? was (Author: maver1ck): [~michael], [~yhuai] I think this is smallest change that resolves my problem. {code} diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala index f1c78bb..2b84885 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala @@ -323,6 +323,8 @@ private[sql] class ParquetFileFormat None } +pushed.foreach(ParquetInputFormat.setFilterPredicate(hadoopConf, _)) + val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) {code} Is anybody can tell why it works ? > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. >
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404646#comment-15404646 ] Maciej Bryński commented on SPARK-16320: [~michael], [~yhuai] I think this is smallest change that resolves my problem. {code} diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala index f1c78bb..2b84885 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala @@ -323,6 +323,8 @@ private[sql] class ParquetFileFormat None } +pushed.foreach(ParquetInputFormat.setFilterPredicate(hadoopConf, _)) + val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) {code} Is anybody can tell why it works ? > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404620#comment-15404620 ] Maciej Bryński commented on SPARK-16320: I think that problem is already resolved by https://github.com/apache/spark/pull/13701 > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404619#comment-15404619 ] Maciej Bryński commented on SPARK-16320: Yes. That's it. With this PR Spark 2.0 is faster than 1.6. > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404611#comment-15404611 ] Apache Spark commented on SPARK-16802: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/14464 > joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException > > > Key: SPARK-16802 > URL: https://issues.apache.org/jira/browse/SPARK-16802 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Assignee: Davies Liu >Priority: Critical > > Hello! > This is a little similar to > [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I > have reopened it?). > I would recommend to give another full review to {{HashedRelation.scala}}, > particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors > that I haven't managed to reproduce so far, as well as what I suspect could > be memory leaks (I have a query in a loop OOMing after a few iterations > despite not caching its results). > Here is the script to reproduce the ArrayIndexOutOfBoundsException on the > current 2.0 branch: > {code} > import os > import random > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > def randlong(): > return random.randint(-9223372036854775808, 9223372036854775807) > while True: > l1, l2 = randlong(), randlong() > # Sample values that crash: > # l1, l2 = 4661454128115150227, -5543241376386463808 > print "Testing with %s, %s" % (l1, l2) > data1 = [(l1, ), (l2, )] > data2 = [(l1, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > sqlc.registerDataFrameAsTable(df1, "df1") > sqlc.registerDataFrameAsTable(df2, "df2") > result_df = sqlc.sql(""" > SELECT > df1.id1 > FROM df1 > LEFT OUTER JOIN df2 ON df1.id1 = df2.id2 > """) > print result_df.collect() > {code} > {code} > java.lang.ArrayIndexOutOfBoundsException: 1728150825 > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) > at > org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at >
[jira] [Resolved] (SPARK-16787) SparkContext.addFile() should not fail if called twice with the same file
[ https://issues.apache.org/jira/browse/SPARK-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-16787. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14396 [https://github.com/apache/spark/pull/14396] > SparkContext.addFile() should not fail if called twice with the same file > - > > Key: SPARK-16787 > URL: https://issues.apache.org/jira/browse/SPARK-16787 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.1, 2.1.0 > > > The behavior of SparkContext.addFile() changed slightly with the introduction > of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where > it was disabled by default) and became the default / only file server in > Spark 2.0.0. > Prior to 2.0, calling SparkContext.addFile() twice with the same path would > succeed and would cause future tasks to receive an updated copy of the file. > This behavior was never explicitly documented but Spark has behaved this way > since very early 1.x versions (some of the relevant lines in > Executor.updateDependencies() have existed since 2012). > In 2.0 (or 1.6 with the Netty file server enabled), the second addFile() call > will fail with a requirement error because NettyStreamManager tries to guard > against duplicate file registration. > I believe that this change of behavior was unintentional and propose to > remove the {{require}} check so that Spark 2.0 matches 1.x's default behavior. > This problem also affects addJar() in a more subtle way: the > fileServer.addJar() call will also fail with an exception but that exception > is logged and ignored due to some code which was added in 2014 in order to > ignore errors caused by missing Spark examples JARs when running on YARN > cluster mode (AFAIK). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16838) Add PMML export for ML KMeans in PySpark
[ https://issues.apache.org/jira/browse/SPARK-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404591#comment-15404591 ] Gayathri Murali commented on SPARK-16838: - I can work on this > Add PMML export for ML KMeans in PySpark > > > Key: SPARK-16838 > URL: https://issues.apache.org/jira/browse/SPARK-16838 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk > > After we finish SPARK-11237 we should also expose PMML export in the Python > API for KMeans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16859) History Server storage information is missing
[ https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Ivanov updated SPARK-16859: -- Description: It looks like job history storage tab in history server is broken for completed jobs since *1.6.2*. More specifically it's broken since [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. I've fixed for my installation by effectively reverting the above patch ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). IMHO, the most straightforward fix would be to implement _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ and making sure it works from _ReplayListenerBus_. The downside will be that it will still work incorrectly with pre patch job histories. But then, it doesn't work since *1.6.2* anyhow. PS: I'd really love to have this fixed eventually. But I'm pretty new to Apache Spark and missing hands on Scala experience. So I'd prefer that it be fixed by someone experienced with roadmap vision. If nobody volunteers I'll try to patch myself. was: It looks like job history storage tab in history server is broken for completed jobs since *1.6.2*. More specifically it's broken since [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. I've fixed for my installation by effectively reverting the above patch ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). IMHO, the most straightforward fix would be to implement _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ and making sure it works from _ReplayListenerBus_. The downside will be that it will still work incorrectly with pre patch job histories. But then, it doesn't work since *1.6.2* anyhow. > History Server storage information is missing > - > > Key: SPARK-16859 > URL: https://issues.apache.org/jira/browse/SPARK-16859 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Andrey Ivanov > Labels: historyserver, newbie > > It looks like job history storage tab in history server is broken for > completed jobs since *1.6.2*. > More specifically it's broken since > [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. > I've fixed for my installation by effectively reverting the above patch > ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). > IMHO, the most straightforward fix would be to implement > _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ and > making sure it works from _ReplayListenerBus_. > The downside will be that it will still work incorrectly with pre patch job > histories. But then, it doesn't work since *1.6.2* anyhow. > PS: I'd really love to have this fixed eventually. But I'm pretty new to > Apache Spark and missing hands on Scala experience. So I'd prefer that it be > fixed by someone experienced with roadmap vision. If nobody volunteers I'll > try to patch myself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6399) Code compiled against 1.3.0 may not run against older Spark versions
[ https://issues.apache.org/jira/browse/SPARK-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-6399. --- Resolution: Won't Fix I think at this point it's pretty clear we won't do anything here. > Code compiled against 1.3.0 may not run against older Spark versions > > > Key: SPARK-6399 > URL: https://issues.apache.org/jira/browse/SPARK-6399 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Affects Versions: 1.3.0 >Reporter: Marcelo Vanzin >Priority: Minor > > Commit 65b987c3 re-organized the implicit conversions of RDDs so that they're > easier to use. The problem is that scalac now generates code that will not > run on older Spark versions if those conversions are used. > Basically, even if you explicitly import {{SparkContext._}}, scalac will > generate references to the new methods in the {{RDD}} object instead. So the > compiled code will reference code that doesn't exist in older versions of > Spark. > You can work around this by explicitly calling the methods in the > {{SparkContext}} object, although that's a little ugly. > We should at least document this limitation (if there's no way to fix it), > since I believe forwards compatibility in the API was also a goal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16859) History Server storage information is missing
Andrey Ivanov created SPARK-16859: - Summary: History Server storage information is missing Key: SPARK-16859 URL: https://issues.apache.org/jira/browse/SPARK-16859 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0, 1.6.2 Reporter: Andrey Ivanov It looks like job history storage tab in history server is broken for completed jobs since *1.6.2*. More specifically it's broken since [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845]. I've fixed for my installation by effectively reverting the above patch ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]). IMHO, the most straightforward fix would be to implement _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ and making sure it works from _ReplayListenerBus_. The downside will be that it will still work incorrectly with pre patch job histories. But then, it doesn't work since *1.6.2* anyhow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16855) move Greatest and Least from conditionalExpressions.scala to arithmetic.scala
[ https://issues.apache.org/jira/browse/SPARK-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16855: Fix Version/s: (was: 2.0.1) > move Greatest and Least from conditionalExpressions.scala to arithmetic.scala > - > > Key: SPARK-16855 > URL: https://issues.apache.org/jira/browse/SPARK-16855 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16855) move Greatest and Least from conditionalExpressions.scala to arithmetic.scala
[ https://issues.apache.org/jira/browse/SPARK-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16855. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > move Greatest and Least from conditionalExpressions.scala to arithmetic.scala > - > > Key: SPARK-16855 > URL: https://issues.apache.org/jira/browse/SPARK-16855 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16796) Visible passwords on Spark environment page
[ https://issues.apache.org/jira/browse/SPARK-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16796: -- Assignee: Artur > Visible passwords on Spark environment page > --- > > Key: SPARK-16796 > URL: https://issues.apache.org/jira/browse/SPARK-16796 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Artur >Assignee: Artur >Priority: Trivial > Attachments: > Mask_spark_ssl_keyPassword_spark_ssl_keyStorePassword_spark_ssl_trustStorePassword_from_We1.patch > > > Spark properties (passwords): > spark.ssl.keyPassword,spark.ssl.keyStorePassword,spark.ssl.trustStorePassword > are visible in Web UI in environment page. > Can we mask them from Web UI? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16858) Removal of TestHiveSharedState
[ https://issues.apache.org/jira/browse/SPARK-16858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16858: Assignee: (was: Apache Spark) > Removal of TestHiveSharedState > -- > > Key: SPARK-16858 > URL: https://issues.apache.org/jira/browse/SPARK-16858 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Remove TestHiveSharedState. Otherwise, we are not really testing the > reflection logic based on the setting of we are not really testing the > reflection logic based on the setting of CATALOG_IMPLEMENTATION. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16858) Removal of TestHiveSharedState
[ https://issues.apache.org/jira/browse/SPARK-16858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404464#comment-15404464 ] Apache Spark commented on SPARK-16858: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14463 > Removal of TestHiveSharedState > -- > > Key: SPARK-16858 > URL: https://issues.apache.org/jira/browse/SPARK-16858 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Remove TestHiveSharedState. Otherwise, we are not really testing the > reflection logic based on the setting of we are not really testing the > reflection logic based on the setting of CATALOG_IMPLEMENTATION. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16858) Removal of TestHiveSharedState
[ https://issues.apache.org/jira/browse/SPARK-16858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16858: Assignee: Apache Spark > Removal of TestHiveSharedState > -- > > Key: SPARK-16858 > URL: https://issues.apache.org/jira/browse/SPARK-16858 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Remove TestHiveSharedState. Otherwise, we are not really testing the > reflection logic based on the setting of we are not really testing the > reflection logic based on the setting of CATALOG_IMPLEMENTATION. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16858) Removal of TestHiveSharedState
Xiao Li created SPARK-16858: --- Summary: Removal of TestHiveSharedState Key: SPARK-16858 URL: https://issues.apache.org/jira/browse/SPARK-16858 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li Remove TestHiveSharedState. Otherwise, we are not really testing the reflection logic based on the setting of we are not really testing the reflection logic based on the setting of CATALOG_IMPLEMENTATION. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader
[ https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-15639: --- Target Version/s: 2.0.1 > Try to push down filter at RowGroups level for parquet reader > - > > Key: SPARK-15639 > URL: https://issues.apache.org/jira/browse/SPARK-15639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Blocker > > When we use vecterized parquet reader, although the base reader (i.e., > SpecificParquetRecordReaderBase) will retrieve pushed-down filters for > RowGroups-level filtering, we seems not really set up the filters to be > pushed down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader
[ https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-15639: --- Priority: Blocker (was: Major) > Try to push down filter at RowGroups level for parquet reader > - > > Key: SPARK-15639 > URL: https://issues.apache.org/jira/browse/SPARK-15639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Blocker > > When we use vecterized parquet reader, although the base reader (i.e., > SpecificParquetRecordReaderBase) will retrieve pushed-down filters for > RowGroups-level filtering, we seems not really set up the filters to be > pushed down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16816) Add documentation to create JavaSparkContext from SparkSession
[ https://issues.apache.org/jira/browse/SPARK-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16816: -- Assignee: sandeep purohit > Add documentation to create JavaSparkContext from SparkSession > -- > > Key: SPARK-16816 > URL: https://issues.apache.org/jira/browse/SPARK-16816 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.0 >Reporter: sandeep purohit >Assignee: sandeep purohit >Priority: Trivial > Fix For: 2.1.0 > > Original Estimate: 3h > Remaining Estimate: 3h > > In this issue the user can know how to create the JavaSparkContext with spark > session. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16816) Add documentation to create JavaSparkContext from SparkSession
[ https://issues.apache.org/jira/browse/SPARK-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16816. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14436 [https://github.com/apache/spark/pull/14436] > Add documentation to create JavaSparkContext from SparkSession > -- > > Key: SPARK-16816 > URL: https://issues.apache.org/jira/browse/SPARK-16816 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.0 >Reporter: sandeep purohit >Priority: Trivial > Fix For: 2.1.0 > > Original Estimate: 3h > Remaining Estimate: 3h > > In this issue the user can know how to create the JavaSparkContext with spark > session. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16850) Improve error message for greatest/least
[ https://issues.apache.org/jira/browse/SPARK-16850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16850: Fix Version/s: 2.0.1 > Improve error message for greatest/least > > > Key: SPARK-16850 > URL: https://issues.apache.org/jira/browse/SPARK-16850 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Peter Lee >Assignee: Peter Lee > Fix For: 2.0.1, 2.1.0 > > > Greatest/least function does not have the most friendly error message for > data types: > Error in SQL statement: AnalysisException: cannot resolve 'greatest(CAST(1.0 > AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should > all have the same type, got GREATEST (ArrayBuffer(DecimalType(2,1), > StringType)).; line 1 pos 7 > We should report the human readable data type instead, rather than having > "ArrayBuffer" and "StringType". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException
Ryan Claussen created SPARK-16857: - Summary: CrossValidator and KMeans throws IllegalArgumentException Key: SPARK-16857 URL: https://issues.apache.org/jira/browse/SPARK-16857 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.6.1 Environment: spark-jobserver docker image. Spark 1.6.1 on ubuntu, Hadoop 2.4 Reporter: Ryan Claussen I am attempting to use CrossValidation to train KMeans model. When I attempt to fit the data spark throws an IllegalArgumentException as below since the KMeans algorithm outputs an Integer into the prediction column instead of a Double. Before I go too far: is using CrossValidation with Kmeans supported? Here's the exception: {quote} java.lang.IllegalArgumentException: requirement failed: Column prediction must be of type DoubleType but was actually IntegerType. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109) at org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99) at com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202) at com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62) at com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39) at spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {quote} Here is the code I'm using to set up my cross validator. As the stack trace above indicates it is failing at the fit step when {quote} ... val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures") val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels) val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, mpc, labelConverter)) val evaluator = new MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction") val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, 200, 500)).build() val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3) val cvModel = cv.fit(trainingData) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16836) Hive date/time function error
[ https://issues.apache.org/jira/browse/SPARK-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16836. - Resolution: Fixed Assignee: Herman van Hovell Fix Version/s: 2.1.0 2.0.1 > Hive date/time function error > - > > Key: SPARK-16836 > URL: https://issues.apache.org/jira/browse/SPARK-16836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jesse Lord >Assignee: Herman van Hovell >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > Previously available hive functions for date/time are not available in Spark > 2.0 (e.g. current_date, current_timestamp). These functions work in Spark > 1.6.2 with HiveContext. > Example (from spark-shell): > {noformat} > scala> spark.sql("select current_date") > org.apache.spark.sql.AnalysisException: cannot resolve '`current_date`' given > input columns: []; line 1 pos 7 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) > ... 48 elided > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16062) PySpark SQL python-only UDTs don't work well
[ https://issues.apache.org/jira/browse/SPARK-16062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-16062. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 13778 [https://github.com/apache/spark/pull/13778] > PySpark SQL python-only UDTs don't work well > > > Key: SPARK-16062 > URL: https://issues.apache.org/jira/browse/SPARK-16062 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > Fix For: 2.0.1, 2.1.0 > > > Python-only UDTs can't work well. One example is: > {code} > import pyspark.sql.group > from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT > from pyspark.sql.types import * > schema = StructType().add("key", LongType()).add("val", PythonOnlyUDT()) > df = spark.createDataFrame([(i % 3, PythonOnlyPoint(float(i), float(i))) for > i in range(10)], schema=schema) > df.collect() # this works > df.show() # this doesn't work > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15989) PySpark SQL python-only UDTs don't support nested types
[ https://issues.apache.org/jira/browse/SPARK-15989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-15989. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 13778 [https://github.com/apache/spark/pull/13778] > PySpark SQL python-only UDTs don't support nested types > --- > > Key: SPARK-15989 > URL: https://issues.apache.org/jira/browse/SPARK-15989 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Vladimir Feinberg > Fix For: 2.0.1, 2.1.0 > > > [This > notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/611202526513296/1653464426712019/latest.html] > demonstrates the bug. > The obvious issue is that nested UDTs are not supported if the UDT is > Python-only. Looking at the exception thrown, this seems to be because the > encoder on the Java end tries to encode the UDT as a Java class, which > doesn't exist for the [[PythonOnlyUDT]]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404369#comment-15404369 ] Yin Huai commented on SPARK-16320: -- Can you also try https://github.com/apache/spark/pull/13701 and see if it makes any difference? > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404369#comment-15404369 ] Yin Huai edited comment on SPARK-16320 at 8/2/16 4:52 PM: -- [~maver1ck] Can you also try https://github.com/apache/spark/pull/13701 and see if it makes any difference? was (Author: yhuai): Can you also try https://github.com/apache/spark/pull/13701 and see if it makes any difference? > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404361#comment-15404361 ] Michael Allman commented on SPARK-16320: [~maver1ck] I'm having trouble reproducing your problem. I would like to help, but I need a dataset to work with. Can you please post a test dataset (to S3) and query in which you see the regression? > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16856) Link application summary page and detail page to the master page
[ https://issues.apache.org/jira/browse/SPARK-16856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404331#comment-15404331 ] Apache Spark commented on SPARK-16856: -- User 'nblintao' has created a pull request for this issue: https://github.com/apache/spark/pull/14461 > Link application summary page and detail page to the master page > > > Key: SPARK-16856 > URL: https://issues.apache.org/jira/browse/SPARK-16856 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Reporter: Tao Lin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16856) Link application summary page and detail page to the master page
[ https://issues.apache.org/jira/browse/SPARK-16856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16856: Assignee: (was: Apache Spark) > Link application summary page and detail page to the master page > > > Key: SPARK-16856 > URL: https://issues.apache.org/jira/browse/SPARK-16856 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Reporter: Tao Lin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16856) Link application summary page and detail page to the master page
[ https://issues.apache.org/jira/browse/SPARK-16856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16856: Assignee: Apache Spark > Link application summary page and detail page to the master page > > > Key: SPARK-16856 > URL: https://issues.apache.org/jira/browse/SPARK-16856 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Reporter: Tao Lin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16835) LinearRegression LogisticRegression AFTSuvivalRegression should unpersist input training data when exception throws
[ https://issues.apache.org/jira/browse/SPARK-16835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16835. --- Resolution: Won't Fix > LinearRegression LogisticRegression AFTSuvivalRegression should unpersist > input training data when exception throws > --- > > Key: SPARK-16835 > URL: https://issues.apache.org/jira/browse/SPARK-16835 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1, 2.1.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > Currently, > LinearRegression LogisticRegression AFTSuvivalRegression' s `fit` interface > will persist the `dataframe` passed in if it is not persisted. > but when possible exception throws, it won't be unpersist. > As `fit` is a public interface, it is better to unpersist the parameter > object passed in in any case, if it do `persist` internally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16837) TimeWindow incorrectly drops slideDuration in constructors
[ https://issues.apache.org/jira/browse/SPARK-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16837: -- Assignee: Tom Magrino > TimeWindow incorrectly drops slideDuration in constructors > -- > > Key: SPARK-16837 > URL: https://issues.apache.org/jira/browse/SPARK-16837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tom Magrino >Assignee: Tom Magrino > Fix For: 2.0.1, 2.1.0 > > > Right now, the constructors for the TimeWindow expression in Catalyst > incorrectly uses the windowDuration in place of the slideDuration. This will > cause incorrect windowing semantics after time window expressions are > analyzed by Catalyst. > Relevant code is here: > https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16837) TimeWindow incorrectly drops slideDuration in constructors
[ https://issues.apache.org/jira/browse/SPARK-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16837. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14441 [https://github.com/apache/spark/pull/14441] > TimeWindow incorrectly drops slideDuration in constructors > -- > > Key: SPARK-16837 > URL: https://issues.apache.org/jira/browse/SPARK-16837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Tom Magrino > Fix For: 2.0.1, 2.1.0 > > > Right now, the constructors for the TimeWindow expression in Catalyst > incorrectly uses the windowDuration in place of the slideDuration. This will > cause incorrect windowing semantics after time window expressions are > analyzed by Catalyst. > Relevant code is here: > https://github.com/apache/spark/blob/branch-2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala#L29-L54 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16822) Support latex in scaladoc with MathJax
[ https://issues.apache.org/jira/browse/SPARK-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16822: -- Assignee: Shuai Lin > Support latex in scaladoc with MathJax > -- > > Key: SPARK-16822 > URL: https://issues.apache.org/jira/browse/SPARK-16822 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Shuai Lin >Assignee: Shuai Lin >Priority: Minor > Fix For: 2.1.0 > > > The scaladoc of some classes (mainly ml/mllib classes) include math formulas, > but currently it renders very ugly, e.g. [the doc of the LogisticGradient > class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient]. > We can improve this by including MathJax javascripts in the scaladocs page, > much like what we do for the markdown docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16822) Support latex in scaladoc with MathJax
[ https://issues.apache.org/jira/browse/SPARK-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16822. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14438 [https://github.com/apache/spark/pull/14438] > Support latex in scaladoc with MathJax > -- > > Key: SPARK-16822 > URL: https://issues.apache.org/jira/browse/SPARK-16822 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Shuai Lin >Priority: Minor > Fix For: 2.1.0 > > > The scaladoc of some classes (mainly ml/mllib classes) include math formulas, > but currently it renders very ugly, e.g. [the doc of the LogisticGradient > class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient]. > We can improve this by including MathJax javascripts in the scaladocs page, > much like what we do for the markdown docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16856) Link application summary page and detail page to the master page
Tao Lin created SPARK-16856: --- Summary: Link application summary page and detail page to the master page Key: SPARK-16856 URL: https://issues.apache.org/jira/browse/SPARK-16856 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Reporter: Tao Lin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16520) Link executors to corresponding worker pages
[ https://issues.apache.org/jira/browse/SPARK-16520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Lin updated SPARK-16520: Priority: Major (was: Minor) > Link executors to corresponding worker pages > > > Key: SPARK-16520 > URL: https://issues.apache.org/jira/browse/SPARK-16520 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Tao Lin > > In the executor page, we add links from executors to their corresponding > worker pages. Basically, we add links to the address or executor IDs linking > to their corresponding worker pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15541) SparkContext.stop throws error
[ https://issues.apache.org/jira/browse/SPARK-15541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15541. --- Resolution: Resolved Assignee: Maciej Bryński Resolved by https://github.com/apache/spark/pull/14459 > SparkContext.stop throws error > -- > > Key: SPARK-15541 > URL: https://issues.apache.org/jira/browse/SPARK-15541 > Project: Spark > Issue Type: Bug >Reporter: Miao Wang >Assignee: Maciej Bryński > > When running unit-tests or examples from command line or Intellij, > SparkContext throws errors. > For example: > ./bin/run-example ml.JavaNaiveBayesExample > Exception in thread "main" 16/05/25 15:17:55 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > java.lang.NoSuchMethodError: > java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView; > at org.apache.spark.rpc.netty.Dispatcher.stop(Dispatcher.scala:176) > at org.apache.spark.rpc.netty.NettyRpcEnv.cleanup(NettyRpcEnv.scala:291) > at > org.apache.spark.rpc.netty.NettyRpcEnv.shutdown(NettyRpcEnv.scala:269) > at org.apache.spark.SparkEnv.stop(SparkEnv.scala:91) > at > org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1796) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1795) > at org.apache.spark.sql.SparkSession.stop(SparkSession.scala:577) > at > org.apache.spark.examples.ml.JavaNaiveBayesExample.main(JavaNaiveBayesExample.java:61) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:724) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 16/05/25 15:17:55 INFO ShutdownHookManager: Shutdown hook called -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16854) mapWithState Support for Python
[ https://issues.apache.org/jira/browse/SPARK-16854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404217#comment-15404217 ] Boaz commented on SPARK-16854: -- IMHO, streaming in python would be incomplete without mapWithState. Finding it hard to use updateStateByKey for large states. > mapWithState Support for Python > --- > > Key: SPARK-16854 > URL: https://issues.apache.org/jira/browse/SPARK-16854 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 >Reporter: Boaz > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12650) No means to specify Xmx settings for spark-submit in cluster deploy mode for Spark on YARN
[ https://issues.apache.org/jira/browse/SPARK-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-12650: Summary: No means to specify Xmx settings for spark-submit in cluster deploy mode for Spark on YARN (was: No means to specify Xmx settings for SparkSubmit in cluster deploy mode for Spark on YARN) > No means to specify Xmx settings for spark-submit in cluster deploy mode for > Spark on YARN > -- > > Key: SPARK-12650 > URL: https://issues.apache.org/jira/browse/SPARK-12650 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.5.2 > Environment: Hadoop 2.6.0 >Reporter: John Vines > > Background- > I have an app master designed to do some work and then launch a spark job. > Issue- > If I use yarn-cluster, then the SparkSubmit does not Xmx itself at all, > leading to the jvm taking a default heap which is relatively large. This > causes a large amount of vmem to be taken, so that it is killed by yarn. This > can be worked around by disabling Yarn's vmem check, but that is a hack. > If I run it in yarn-client mode, it's fine as long as my container has enough > space for the driver, which is manageable. But I feel that the utter lack of > Xmx settings for what I believe is a very small jvm is a problem. > I believe this was introduced with the fix for SPARK-3884 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12650) No means to specify Xmx settings for SparkSubmit in cluster deploy mode for Spark on YARN
[ https://issues.apache.org/jira/browse/SPARK-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-12650: Summary: No means to specify Xmx settings for SparkSubmit in cluster deploy mode for Spark on YARN (was: No means to specify Xmx settings for SparkSubmit in yarn-cluster mode) > No means to specify Xmx settings for SparkSubmit in cluster deploy mode for > Spark on YARN > - > > Key: SPARK-12650 > URL: https://issues.apache.org/jira/browse/SPARK-12650 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.5.2 > Environment: Hadoop 2.6.0 >Reporter: John Vines > > Background- > I have an app master designed to do some work and then launch a spark job. > Issue- > If I use yarn-cluster, then the SparkSubmit does not Xmx itself at all, > leading to the jvm taking a default heap which is relatively large. This > causes a large amount of vmem to be taken, so that it is killed by yarn. This > can be worked around by disabling Yarn's vmem check, but that is a hack. > If I run it in yarn-client mode, it's fine as long as my container has enough > space for the driver, which is manageable. But I feel that the utter lack of > Xmx settings for what I believe is a very small jvm is a problem. > I believe this was introduced with the fix for SPARK-3884 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14453) Remove SPARK_JAVA_OPTS environment variable
[ https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404175#comment-15404175 ] Jacek Laskowski commented on SPARK-14453: - Is anyone working on it? Just found few places in Spark on YARN that should long have been removed (only introduce some noise in the code). I'd like to work on this if possible. > Remove SPARK_JAVA_OPTS environment variable > --- > > Key: SPARK-14453 > URL: https://issues.apache.org/jira/browse/SPARK-14453 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao >Priority: Minor > > SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version > (2.0), I think it would be better to remove the support of this env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14453) Remove SPARK_JAVA_OPTS environment variable
[ https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-14453: Issue Type: Task (was: Bug) > Remove SPARK_JAVA_OPTS environment variable > --- > > Key: SPARK-14453 > URL: https://issues.apache.org/jira/browse/SPARK-14453 > Project: Spark > Issue Type: Task > Components: Spark Core, YARN >Reporter: Saisai Shao >Priority: Minor > > SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version > (2.0), I think it would be better to remove the support of this env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14453) Remove SPARK_JAVA_OPTS environment variable
[ https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-14453: Summary: Remove SPARK_JAVA_OPTS environment variable (was: Consider removing SPARK_JAVA_OPTS env variable) > Remove SPARK_JAVA_OPTS environment variable > --- > > Key: SPARK-14453 > URL: https://issues.apache.org/jira/browse/SPARK-14453 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao >Priority: Minor > > SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version > (2.0), I think it would be better to remove the support of this env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16852) RejectedExecutionException when exit at some times
[ https://issues.apache.org/jira/browse/SPARK-16852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404166#comment-15404166 ] Weizhong commented on SPARK-16852: -- I run 2T tpcds, and some times will print the stack. {noformat} cases="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99" i=1 total=99 for t in ${cases} do echo "run sql99 sample query $t ($i/$total)." spark-sql --master yarn-client --driver-memory 30g --executor-memory 30g --num-executors 6 --executor-cores 5 -f query/query${t}.sql > logs/${t}.result 2> logs/${t}.log i=`expr $i + 1` done {noformat} > RejectedExecutionException when exit at some times > -- > > Key: SPARK-16852 > URL: https://issues.apache.org/jira/browse/SPARK-16852 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Weizhong >Priority: Minor > > If we run a huge job, some times when exit will print > RejectedExecutionException > {noformat} > 16/05/27 08:30:40 ERROR client.TransportResponseHandler: Still have 3 > requests outstanding when connection from HGH117808/10.184.66.104:41980 > is closed > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@6b66dba rejected from > java.util.concurrent.ThreadPoolExecutor@60725736[Terminated, pool size = 0, > active threads = 0, queued tasks = 0, completed tasks = 269] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > at scala.concurrent.Promise$class.complete(Promise.scala:55) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > at scala.concurrent.Promise$class.complete(Promise.scala:55) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643) > at > scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658) > at > scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) > at > scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at > scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634) > at > scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > at scala.concurrent.Promise$class.tryFailure(Promise.scala:115) > at > scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153) > at > org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:192) > at > org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$1.apply(NettyRpcEnv.scala:214) > at >
[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns
[ https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404114#comment-15404114 ] Maciej Bryński edited comment on SPARK-16320 at 8/2/16 3:05 PM: [~clockfly] I tested your patch. Results are equal to Spark 2.0 without the patch. But I have one more observation. For every Stage there is "Input" column in Spark-UI. For Spark 2.0 it shows up to 11 GB. For Spark 1.6 only 2.8 GB. Maybe this is connected ? I attached two screenshots from UI. PS. I also tested spark.sql.codegen.maxFields parameter. I force Spark to use whole stage generation. But it makes almost no difference was (Author: maver1ck): [~clockfly] I tested your patch. Results are equal to Spark 2.0 without the patch. But I have one more observation. For every Stage there is "Input" column in Spark-UI. For Spark 2.0 it shows up to 11 GB. For Spark 1.6 only 2.8 GB. Maybe this is connected ? I attached two screenshots from UI. > Spark 2.0 slower than 1.6 when querying nested columns > -- > > Key: SPARK-16320 > URL: https://issues.apache.org/jira/browse/SPARK-16320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > Attachments: spark1.6-ui.png, spark2-ui.png > > > I did some test on parquet file with many nested columns (about 30G in > 400 partitions) and Spark 2.0 is sometimes slower. > I tested following queries: > 1) {code}select count(*) where id > some_id{code} > In this query performance is similar. (about 1 sec) > 2) {code}select count(*) where nested_column.id > some_id{code} > Spark 1.6 -> 1.6 min > Spark 2.0 -> 2.1 min > Should I expect such a drop in performance ? > I don't know how to prepare sample data to show the problem. > Any ideas ? Or public data with many nested columns ? > *UPDATE* > I created script to generate data and to confirm this problem. > {code} > #Initialization > from pyspark import SparkContext, SparkConf > from pyspark.sql import HiveContext > from pyspark.sql.functions import struct > conf = SparkConf() > conf.set('spark.cores.max', 15) > conf.set('spark.executor.memory', '30g') > conf.set('spark.driver.memory', '30g') > sc = SparkContext(conf=conf) > sqlctx = HiveContext(sc) > #Data creation > MAX_SIZE = 2**32 - 1 > path = '/mnt/mfs/parquet_nested' > def create_sample_data(levels, rows, path): > > def _create_column_data(cols): > import random > random.seed() > return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in > range(cols)} > > def _create_sample_df(cols, rows): > rdd = sc.parallelize(range(rows)) > data = rdd.map(lambda r: _create_column_data(cols)) > df = sqlctx.createDataFrame(data) > return df > > def _create_nested_data(levels, rows): > if len(levels) == 1: > return _create_sample_df(levels[0], rows).cache() > else: > df = _create_nested_data(levels[1:], rows) > return df.select([struct(df.columns).alias("column{}".format(i)) > for i in range(levels[0])]) > df = _create_nested_data(levels, rows) > df.write.mode('overwrite').parquet(path) > > #Sample data > create_sample_data([2,10,200], 100, path) > #Query > df = sqlctx.read.parquet(path) > %%timeit > df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count() > {code} > Results > Spark 1.6 > 1 loop, best of 3: *1min 5s* per loop > Spark 2.0 > 1 loop, best of 3: *1min 21s* per loop > *UPDATE 2* > Analysis in https://issues.apache.org/jira/browse/SPARK-16321 direct to same > source. > I attached some VisualVM profiles there. > Most interesting are from queries. > https://issues.apache.org/jira/secure/attachment/12818785/spark16_query.nps > https://issues.apache.org/jira/secure/attachment/12818784/spark2_query.nps -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org