[jira] [Created] (SPARK-17883) Possible typo in comments of Row.scala
Weiluo Ren created SPARK-17883: -- Summary: Possible typo in comments of Row.scala Key: SPARK-17883 URL: https://issues.apache.org/jira/browse/SPARK-17883 Project: Spark Issue Type: Documentation Components: SQL Reporter: Weiluo Ren Priority: Trivial The description of the private method {code}private def getAnyValAs[T <: AnyVal](i: Int): T{code} says "Returns the value of a given fieldName." on https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala#L465 It should be "Returns the value at position i." instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17108) BIGINT and INT comparison failure in spark sql
[ https://issues.apache.org/jira/browse/SPARK-17108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17108: Assignee: Apache Spark > BIGINT and INT comparison failure in spark sql > -- > > Key: SPARK-17108 > URL: https://issues.apache.org/jira/browse/SPARK-17108 > Project: Spark > Issue Type: Bug >Reporter: Sai Krishna Kishore Beathanabhotla >Assignee: Apache Spark > > I have a Hive table with the following definition: > {noformat} > create table testforerror ( > my_column MAP> > ); > {noformat} > The table has the following records > {noformat} > hive> select * from testforerror; > OK > {11001:["0034111000a4WaAAA2"]} > {11001:["0034111000orWiWAAU"]} > {11001:["","0034111000VgrHdAAJ"]} > {11001:["003411cS4rDAAS"]} > {12001:["0037110001a7ofsAAA"]} > Time taken: 0.067 seconds, Fetched: 5 row(s) > {noformat} > I have a query which filters records with key of the my_column. The query is > as follows > {noformat} > select * from testforerror where my_column[11001] is not null; > {noformat} > This query is executing fine from hive/beeline shell and producing the > following records: > {noformat} > hive> select * from testforerror where my_column[11001] is not null; > OK > {11001:["0034111000a4WaAAA2"]} > {11001:["0034111000orWiWAAU"]} > {11001:["","0034111000VgrHdAAJ"]} > {11001:["003411cS4rDAAS"]} > Time taken: 2.224 seconds, Fetched: 4 row(s) > {noformat} > But however I get an error when trying to execute from spark sqlContext. The > following is the error message: > {noformat} > scala> val errorquery = "select * from testforerror where my_column[11001] is > not null" > errorquery: String = select * from testforerror where my_column[11001] is not > null > scala> sqlContext.sql(errorquery).show() > org.apache.spark.sql.AnalysisException: cannot resolve 'my_column[11001]' due > to data type mismatch: argument 2 requires bigint type, however, '11001' is > of int type.; line 1 pos 43 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17108) BIGINT and INT comparison failure in spark sql
[ https://issues.apache.org/jira/browse/SPARK-17108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567765#comment-15567765 ] Apache Spark commented on SPARK-17108: -- User 'weiqingy' has created a pull request for this issue: https://github.com/apache/spark/pull/15448 > BIGINT and INT comparison failure in spark sql > -- > > Key: SPARK-17108 > URL: https://issues.apache.org/jira/browse/SPARK-17108 > Project: Spark > Issue Type: Bug >Reporter: Sai Krishna Kishore Beathanabhotla > > I have a Hive table with the following definition: > {noformat} > create table testforerror ( > my_column MAP> > ); > {noformat} > The table has the following records > {noformat} > hive> select * from testforerror; > OK > {11001:["0034111000a4WaAAA2"]} > {11001:["0034111000orWiWAAU"]} > {11001:["","0034111000VgrHdAAJ"]} > {11001:["003411cS4rDAAS"]} > {12001:["0037110001a7ofsAAA"]} > Time taken: 0.067 seconds, Fetched: 5 row(s) > {noformat} > I have a query which filters records with key of the my_column. The query is > as follows > {noformat} > select * from testforerror where my_column[11001] is not null; > {noformat} > This query is executing fine from hive/beeline shell and producing the > following records: > {noformat} > hive> select * from testforerror where my_column[11001] is not null; > OK > {11001:["0034111000a4WaAAA2"]} > {11001:["0034111000orWiWAAU"]} > {11001:["","0034111000VgrHdAAJ"]} > {11001:["003411cS4rDAAS"]} > Time taken: 2.224 seconds, Fetched: 4 row(s) > {noformat} > But however I get an error when trying to execute from spark sqlContext. The > following is the error message: > {noformat} > scala> val errorquery = "select * from testforerror where my_column[11001] is > not null" > errorquery: String = select * from testforerror where my_column[11001] is not > null > scala> sqlContext.sql(errorquery).show() > org.apache.spark.sql.AnalysisException: cannot resolve 'my_column[11001]' due > to data type mismatch: argument 2 requires bigint type, however, '11001' is > of int type.; line 1 pos 43 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17108) BIGINT and INT comparison failure in spark sql
[ https://issues.apache.org/jira/browse/SPARK-17108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17108: Assignee: (was: Apache Spark) > BIGINT and INT comparison failure in spark sql > -- > > Key: SPARK-17108 > URL: https://issues.apache.org/jira/browse/SPARK-17108 > Project: Spark > Issue Type: Bug >Reporter: Sai Krishna Kishore Beathanabhotla > > I have a Hive table with the following definition: > {noformat} > create table testforerror ( > my_column MAP> > ); > {noformat} > The table has the following records > {noformat} > hive> select * from testforerror; > OK > {11001:["0034111000a4WaAAA2"]} > {11001:["0034111000orWiWAAU"]} > {11001:["","0034111000VgrHdAAJ"]} > {11001:["003411cS4rDAAS"]} > {12001:["0037110001a7ofsAAA"]} > Time taken: 0.067 seconds, Fetched: 5 row(s) > {noformat} > I have a query which filters records with key of the my_column. The query is > as follows > {noformat} > select * from testforerror where my_column[11001] is not null; > {noformat} > This query is executing fine from hive/beeline shell and producing the > following records: > {noformat} > hive> select * from testforerror where my_column[11001] is not null; > OK > {11001:["0034111000a4WaAAA2"]} > {11001:["0034111000orWiWAAU"]} > {11001:["","0034111000VgrHdAAJ"]} > {11001:["003411cS4rDAAS"]} > Time taken: 2.224 seconds, Fetched: 4 row(s) > {noformat} > But however I get an error when trying to execute from spark sqlContext. The > following is the error message: > {noformat} > scala> val errorquery = "select * from testforerror where my_column[11001] is > not null" > errorquery: String = select * from testforerror where my_column[11001] is not > null > scala> sqlContext.sql(errorquery).show() > org.apache.spark.sql.AnalysisException: cannot resolve 'my_column[11001]' due > to data type mismatch: argument 2 requires bigint type, however, '11001' is > of int type.; line 1 pos 43 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14804) Graph vertexRDD/EdgeRDD checkpoint results ClassCastException:
[ https://issues.apache.org/jira/browse/SPARK-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567714#comment-15567714 ] Apache Spark commented on SPARK-14804: -- User 'apivovarov' has created a pull request for this issue: https://github.com/apache/spark/pull/15447 > Graph vertexRDD/EdgeRDD checkpoint results ClassCastException: > --- > > Key: SPARK-14804 > URL: https://issues.apache.org/jira/browse/SPARK-14804 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.1 >Reporter: SuYan >Priority: Minor > > {code} > graph3.vertices.checkpoint() > graph3.vertices.count() > graph3.vertices.map(_._2).count() > {code} > 16/04/21 21:04:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 > (TID 13, localhost): java.lang.ClassCastException: > org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to > scala.Tuple2 > at > com.xiaomi.infra.codelab.spark.Graph2$$anonfun$main$1.apply(Graph2.scala:80) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:91) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > look at the code: > {code} > private[spark] def computeOrReadCheckpoint(split: Partition, context: > TaskContext): Iterator[T] = > { > if (isCheckpointedAndMaterialized) { > firstParent[T].iterator(split, context) > } else { > compute(split, context) > } > } > private[spark] def isCheckpointedAndMaterialized: Boolean = isCheckpointed > override def isCheckpointed: Boolean = { >firstParent[(PartitionID, EdgePartition[ED, VD])].isCheckpointed > } > {code} > for VertexRDD or EdgeRDD, first parent is its partitionRDD > RDD[ShippableVertexPartition[VD]]/RDD[(PartitionID, EdgePartition[ED, VD])] > 1. we call vertexRDD.checkpoint, it partitionRDD will checkpoint, so > VertexRDD.isCheckpointedAndMaterialized=true. > 2. then we call vertexRDD.iterator, because checkoint=true it called > firstParent.iterator(which is not CheckpointRDD, actually is partitionRDD). > > so returned iterator is iterator[ShippableVertexPartition] not expect > iterator[(VertexId, VD)]] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17880) The url linking to `AccumulatorV2` in the document is incorrect.
[ https://issues.apache.org/jira/browse/SPARK-17880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17880. - Resolution: Fixed Assignee: Kousuke Saruta Fix Version/s: 2.1.0 2.0.2 > The url linking to `AccumulatorV2` in the document is incorrect. > > > Key: SPARK-17880 > URL: https://issues.apache.org/jira/browse/SPARK-17880 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 2.0.2, 2.1.0 > > > In `programming-guide.md`, the url which links to `AccumulatorV2` says > `api/scala/index.html#org.apache.spark.AccumulatorV2` but > `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17846) A bad state of Running Applications with spark standalone HA
[ https://issues.apache.org/jira/browse/SPARK-17846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567662#comment-15567662 ] Saisai Shao commented on SPARK-17846: - I think this issue should be the same as SPARK-14262. > A bad state of Running Applications with spark standalone HA > - > > Key: SPARK-17846 > URL: https://issues.apache.org/jira/browse/SPARK-17846 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: dylanzhou >Priority: Critical > Attachments: Problem screenshots.jpg > > > i am use standalone mode,when i use HA from two ways,i found the applications > states was "WAITING",Is this a bug? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567457#comment-15567457 ] Cody Koeninger commented on SPARK-17344: Given the choice between rewriting underlying kafka consumers and having a split codebase, I'd rather have a split codebase. Of course I'd rather not sink development effort into an old version of kafka at all, until the structured stream for 0.10 is working for my use cases. But If you want to wrap the 0.8 rdd in a structured stream, go for it, I'll help you figure out how do it. Seriously. Don't expect larger project uptake, but if you just need something to work for you > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17837) Disaster recovery of offsets from WAL
[ https://issues.apache.org/jira/browse/SPARK-17837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger closed SPARK-17837. -- Resolution: Duplicate Duplicate of SPARK-17829 > Disaster recovery of offsets from WAL > - > > Key: SPARK-17837 > URL: https://issues.apache.org/jira/browse/SPARK-17837 > Project: Spark > Issue Type: Sub-task >Reporter: Cody Koeninger > > "The SQL offsets are stored in a WAL at $checkpointLocation/offsets/$batchId. > As reynold suggests though, we should change this to use a less opaque > format." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17720) Static configurations in SQL
[ https://issues.apache.org/jira/browse/SPARK-17720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-17720. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15295 [https://github.com/apache/spark/pull/15295] > Static configurations in SQL > > > Key: SPARK-17720 > URL: https://issues.apache.org/jira/browse/SPARK-17720 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > > Spark SQL has two kinds of configuration parameters: dynamic configs and > static configs. Dynamic configs can be modified after Spark SQL is launched > (after SparkSession is setup), whereas static configs are immutable once the > service starts. > It would be useful to have this separation and tell users if the user tries > to set a static config after the service starts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17882) RBackendHandler swallowing errors
[ https://issues.apache.org/jira/browse/SPARK-17882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17882: Assignee: Apache Spark > RBackendHandler swallowing errors > - > > Key: SPARK-17882 > URL: https://issues.apache.org/jira/browse/SPARK-17882 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.1 >Reporter: James Shuster >Assignee: Apache Spark >Priority: Minor > > RBackendHandler is swallowing general exceptions in handleMethodCall which > makes it impossible to debug certain issues that happen when doing an > invokeJava call. > In my case this was the following error > java.lang.IllegalAccessException: Class > org.apache.spark.api.r.RBackendHandler can not access a member of class with > modifiers "public final" > The getCause message that is written back was basically blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17882) RBackendHandler swallowing errors
[ https://issues.apache.org/jira/browse/SPARK-17882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567423#comment-15567423 ] Apache Spark commented on SPARK-17882: -- User 'jrshust' has created a pull request for this issue: https://github.com/apache/spark/pull/15446 > RBackendHandler swallowing errors > - > > Key: SPARK-17882 > URL: https://issues.apache.org/jira/browse/SPARK-17882 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.1 >Reporter: James Shuster >Priority: Minor > > RBackendHandler is swallowing general exceptions in handleMethodCall which > makes it impossible to debug certain issues that happen when doing an > invokeJava call. > In my case this was the following error > java.lang.IllegalAccessException: Class > org.apache.spark.api.r.RBackendHandler can not access a member of class with > modifiers "public final" > The getCause message that is written back was basically blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17882) RBackendHandler swallowing errors
[ https://issues.apache.org/jira/browse/SPARK-17882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17882: Assignee: (was: Apache Spark) > RBackendHandler swallowing errors > - > > Key: SPARK-17882 > URL: https://issues.apache.org/jira/browse/SPARK-17882 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.1 >Reporter: James Shuster >Priority: Minor > > RBackendHandler is swallowing general exceptions in handleMethodCall which > makes it impossible to debug certain issues that happen when doing an > invokeJava call. > In my case this was the following error > java.lang.IllegalAccessException: Class > org.apache.spark.api.r.RBackendHandler can not access a member of class with > modifiers "public final" > The getCause message that is written back was basically blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17817) PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes
[ https://issues.apache.org/jira/browse/SPARK-17817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567418#comment-15567418 ] Apache Spark commented on SPARK-17817: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/15445 > PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes > --- > > Key: SPARK-17817 > URL: https://issues.apache.org/jira/browse/SPARK-17817 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1 >Reporter: Mike Dusenberry >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > Calling {{repartition}} on a PySpark RDD to increase the number of partitions > results in highly skewed partition sizes, with most having 0 rows. The > {{repartition}} method should evenly spread out the rows across the > partitions, and this behavior is correctly seen on the Scala side. > Please reference the following code for a reproducible example of this issue: > {code} > # Python > num_partitions = 2 > a = sc.parallelize(range(int(1e6)), 2) # start with 2 even partitions > l = a.repartition(num_partitions).glom().map(len).collect() # get length of > each partition > min(l), max(l), sum(l)/len(l), len(l) # skewed! > # Scala > val numPartitions = 2 > val a = sc.parallelize(0 until 1e6.toInt, 2) # start with 2 even partitions > val l = a.repartition(numPartitions).glom().map(_.length).collect() # get > length of each partition > print(l.min, l.max, l.sum/l.length, l.length) # even! > {code} > The issue here is that highly skewed partitions can result in severe memory > pressure in subsequent steps of a processing pipeline, resulting in OOM > errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17882) RBackendHandler swallowing errors
James Shuster created SPARK-17882: - Summary: RBackendHandler swallowing errors Key: SPARK-17882 URL: https://issues.apache.org/jira/browse/SPARK-17882 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.0.1 Reporter: James Shuster Priority: Minor RBackendHandler is swallowing general exceptions in handleMethodCall which makes it impossible to debug certain issues that happen when doing an invokeJava call. In my case this was the following error java.lang.IllegalAccessException: Class org.apache.spark.api.r.RBackendHandler can not access a member of class with modifiers "public final" The getCause message that is written back was basically blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567367#comment-15567367 ] Jeremy Smith edited comment on SPARK-17344 at 10/12/16 2:56 AM: {quote} By contrast, writing a streaming source shim around the existing simple consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't have stuff like SSL, dynamic topics, or offset committing. {quote} Serious question: Would it be so bad to have a bifurcated codebase here? People who are tied to Kafka 0.8/0.9 will typically know that this is a limitation for them, and are probably not all that concerned about the features you mentioned. In general, structured streaming already provides a lot of the capabilities that I for one am concerned about when using Kafka - offsets are tracked natively by SS, so offset committing isn't that big of a deal; in a CDH cluster specifically, you are probably using network-level security and aren't viewing the lack of SSL as a blocker; and finally you're already resigned to static topic subscriptions because that's what you're getting with the DStream API. A simple Structured Streaming source for Kafka, even using the same underlying technology, would be a HUGE step up: * You won't have "dynamic topics" to the same level, but at least you won't have to throw away all your checkpoints just to do something with a new topic in the same application. Currently, you have to do this, because the entire graph is stored in the checkpoints along with all the topics you're ever going to look at. Structured streaming at least gives you separate checkpoints per source, rather than for the entire StreamingContext. * You're already unable to manually commit offsets; you either have to rewind to the beginning, or throw away everything from the past, or (as before) rely on the incredibly fragile StreamingContext checkpoints. Or, commit the topic/partition/offset to the sink so you can recover the actually processed messages from there. Again, decoupling each operation from the entire state of the StreamingContext is a huge step up, because you can actually upgrade your application code (at least in certain ways) without having to worry about re-processing stuff due to discarding the checkpoints. * It will dramatically simplify the usage of Kafka from Spark in general. 9/10 use cases involve some sort of structured data, the processing of which will have dramatically better performance when being used with tungsten than with RDD-level operations. So if the simple-consumer based Kafka source would be so easy, at the expense of some features, why not introduce it? I have a tremendous amount of respect for the complexity of Kafka and the work you're doing with it, but I also get a sense that the conceptual "perfect" here is the enemy of the good. The weekend project you mentioned would result in a dramatic improvement in the experience for a large percentage of users who are currently using Spark and Kafka together. Most companies are using some kind of Hadoop distribution (i.e. HDP or CDH) and they are slow to update things like Kafka. HDP does have 0.10 (CDH doesn't), but at what rate are people actually able to update HDP? I don't have any data on it (ironically) but I'm guessing that 0.9 still represents a fairly significant portion of the Kafka install base. Just my two cents on the matter. was (Author: jeremyrsmith): > By contrast, writing a streaming source shim around the existing simple > consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't > have stuff like SSL, dynamic topics, or offset committing. Serious question: Would it be so bad to have a bifurcated codebase here? People who are tied to Kafka 0.8/0.9 will typically know that this is a limitation for them, and are probably not all that concerned about the features you mentioned. In general, structured streaming already provides a lot of the capabilities that I for one am concerned about when using Kafka - offsets are tracked natively by SS, so offset committing isn't that big of a deal; in a CDH cluster specifically, you are probably using network-level security and aren't viewing the lack of SSL as a blocker; and finally you're already resigned to static topic subscriptions because that's what you're getting with the DStream API. A simple Structured Streaming source for Kafka, even using the same underlying technology, would be a HUGE step up: * You won't have "dynamic topics" to the same level, but at least you won't have to throw away all your checkpoints just to do something with a new topic in the same application. Currently, you have to do this, because the entire graph is stored in the checkpoints along with all the topics you're ever going to look at. Structured streaming at least gives you separate checkpoints per s
[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567367#comment-15567367 ] Jeremy Smith commented on SPARK-17344: -- > By contrast, writing a streaming source shim around the existing simple > consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't > have stuff like SSL, dynamic topics, or offset committing. Serious question: Would it be so bad to have a bifurcated codebase here? People who are tied to Kafka 0.8/0.9 will typically know that this is a limitation for them, and are probably not all that concerned about the features you mentioned. In general, structured streaming already provides a lot of the capabilities that I for one am concerned about when using Kafka - offsets are tracked natively by SS, so offset committing isn't that big of a deal; in a CDH cluster specifically, you are probably using network-level security and aren't viewing the lack of SSL as a blocker; and finally you're already resigned to static topic subscriptions because that's what you're getting with the DStream API. A simple Structured Streaming source for Kafka, even using the same underlying technology, would be a HUGE step up: * You won't have "dynamic topics" to the same level, but at least you won't have to throw away all your checkpoints just to do something with a new topic in the same application. Currently, you have to do this, because the entire graph is stored in the checkpoints along with all the topics you're ever going to look at. Structured streaming at least gives you separate checkpoints per source, rather than for the entire StreamingContext. * You're already unable to manually commit offsets; you either have to rewind to the beginning, or throw away everything from the past, or (as before) rely on the incredibly fragile StreamingContext checkpoints. Or, commit the topic/partition/offset to the sink so you can recover the actually processed messages from there. Again, decoupling each operation from the entire state of the StreamingContext is a huge step up, because you can actually upgrade your application code (at least in certain ways) without having to worry about re-processing stuff due to discarding the checkpoints. * It will dramatically simplify the usage of Kafka from Spark in general. 9/10 use cases involve some sort of structured data, the processing of which will have dramatically better performance when being used with tungsten than with RDD-level operations. So if the simple-consumer based Kafka source would be so easy, at the expense of some features, why not introduce it? I have a tremendous amount of respect for the complexity of Kafka and the work you're doing with it, but I also get a sense that the conceptual "perfect" here is the enemy of the good. The weekend project you mentioned would result in a dramatic improvement in the experience for a large percentage of users who are currently using Spark and Kafka together. Most companies are using some kind of Hadoop distribution (i.e. HDP or CDH) and they are slow to update things like Kafka. HDP does have 0.10 (CDH doesn't), but at what rate are people actually able to update HDP? I don't have any data on it (ironically) but I'm guessing that 0.9 still represents a fairly significant portion of the Kafka install base. Just my two cents on the matter. > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567362#comment-15567362 ] Liwei Lin commented on SPARK-16845: --- Thanks for the pointer, let me look into this. :-) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567353#comment-15567353 ] Liwei Lin edited comment on SPARK-16845 at 10/12/16 2:51 AM: - [~dondrake] [~Utsumi] Could you provide a simple reproducer? was (Author: lwlin): -[~dondrake] [~Utsumi] Could you provide a simple reproducer?- I've found a reproducer in SPARK-17092; I'll take a look at this, thanks! > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567358#comment-15567358 ] Don Drake commented on SPARK-16845: --- I can't at the moment, mine is not simple. But this JIRA has one: https://issues.apache.org/jira/browse/SPARK-17092 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567353#comment-15567353 ] Liwei Lin edited comment on SPARK-16845 at 10/12/16 2:50 AM: - -[~dondrake] [~Utsumi] Could you provide a simple reproducer?- I've found a reproducer in SPARK-17092; I'll take a look at this, thanks! was (Author: lwlin): [~dondrake] [~Utsumi] Could you provide a simple reproducer? I may help look into this, thanks! > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567353#comment-15567353 ] Liwei Lin commented on SPARK-16845: --- [~dondrake] [~Utsumi] Could you provide a simple reproducer? I may help look into this, thanks! > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas
[ https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567341#comment-15567341 ] Leandro Ferrado edited comment on SPARK-11758 at 10/12/16 2:43 AM: --- Hi Holden. First, I would add just a single line in order to avoid the bad conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) converts a Date column into a LongInt column). The idea is to first convert all columns into string types, thus the function DataFrame.to_records(index=False) wouldn't make bad conversions with datetime.datetime objects. However, that can be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of strings or if we didn't define an schema (in that case, the function create an schema of strings). So, the modification is only present on the condition 'schema=None' and the snippet would be: --- if has_pandas and isinstance(data, pandas.DataFrame): if schema is None: # Having the 2 following lines on the clause schema = [str(x) for x in data.columns] data = data.astype(str) # Converting all fields on string objects because we don't have a defined schema data = [r.tolist() for r in data.to_records(index=False)] --- In case of having an schema with timestamps (e.g. TimestampType() or DateType()), it is needed a prior conversion between datetime.datetime objects on Python to a convenient format for pyspark DataFrames. Regarding to the 'index=False' term, so far I can't figure out an scenario in which it is needed an index per row on a DataFrame. So it may be fine that argument on the function, I'm not sure. was (Author: leferrad): Hi Holden. First, I would add just a single line in order to avoid the bad conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) converts a Date column into a LongInt column). The idea is to first convert all columns into string types, thus the function DataFrame.to_records(index=False) wouldn't make bad conversions with datetime.datetime objects. However, that can be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of strings or if we didn't define an schema (in that case, the function create an schema of strings). So, the modification is only present on the condition 'schema=None' and the snippet would be: --- if has_pandas and isinstance(data, pandas.DataFrame): if schema is None: ## begin if clause ## schema = [str(x) for x in data.columns] data = data.astype(str) # Converting all fields on string objects because we don't have a defined schema ## end if clause ## data = [r.tolist() for r in data.to_records(index=False)] --- In case of having an schema with timestamps (e.g. TimestampType() or DateType()), it is needed a prior conversion between datetime.datetime objects on Python to a convenient format for pyspark DataFrames. Regarding to the 'index=False' term, so far I can't figure out an scenario in which it is needed an index per row on a DataFrame. So it may be fine that argument on the function, I'm not sure. > Missing Index column while creating a DataFrame from Pandas > > > Key: SPARK-11758 > URL: https://issues.apache.org/jira/browse/SPARK-11758 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 > Environment: Linux Debian, PySpark, in local testing. >Reporter: Leandro Ferrado >Priority: Minor > Original Estimate: 5h > Remaining Estimate: 5h > > In PySpark's SQLContext, when it invokes createDataFrame() from a > pandas.DataFrame and indicating a 'schema' with StructFields, the function > _createFromLocal() converts the pandas.DataFrame but ignoring two points: > - Index column, because the flag index=False > - Timestamp's records, because a Date column can't be index and Pandas > doesn't converts its records in Timestamp's type. > So, converting a DataFrame from Pandas to SQL is poor in scenarios with > temporal records. > Doc: > http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html > Affected code: > def _createFromLocal(self, data, schema): > """ > Create an RDD for DataFrame from an list or pandas.DataFrame, returns > the RDD and schema. > """ > if has_pandas and isinstance(data, pandas.DataFrame): > if schema is None: > schema = [str(x) for x in data.columns] > data = [r.tolist() for r in data.to_records(index=False)] # HERE > # ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) --
[jira] [Commented] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas
[ https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567341#comment-15567341 ] Leandro Ferrado commented on SPARK-11758: - Hi Holden. First, I would add just a single line in order to avoid the bad conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) converts a Date column into a LongInt column). The idea is to first convert all columns into string types, thus the function DataFrame.to_records(index=False) wouldn't make bad conversions with datetime.datetime objects. However, that can be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of strings or if we didn't define an schema (in that case, the function create an schema of strings). So, the modification is only present on the condition 'schema=None' and the snippet would be: --- if has_pandas and isinstance(data, pandas.DataFrame): if schema is None: ## begin if clause ## schema = [str(x) for x in data.columns] data = data.astype(str) # Converting all fields on string objects because we don't have a defined schema ## end if clause ## data = [r.tolist() for r in data.to_records(index=False)] --- In case of having an schema with timestamps (e.g. TimestampType() or DateType()), it is needed a prior conversion between datetime.datetime objects on Python to a convenient format for pyspark DataFrames. Regarding to the 'index=False' term, so far I can't figure out an scenario in which it is needed an index per row on a DataFrame. So it may be fine that argument on the function, I'm not sure. > Missing Index column while creating a DataFrame from Pandas > > > Key: SPARK-11758 > URL: https://issues.apache.org/jira/browse/SPARK-11758 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 > Environment: Linux Debian, PySpark, in local testing. >Reporter: Leandro Ferrado >Priority: Minor > Original Estimate: 5h > Remaining Estimate: 5h > > In PySpark's SQLContext, when it invokes createDataFrame() from a > pandas.DataFrame and indicating a 'schema' with StructFields, the function > _createFromLocal() converts the pandas.DataFrame but ignoring two points: > - Index column, because the flag index=False > - Timestamp's records, because a Date column can't be index and Pandas > doesn't converts its records in Timestamp's type. > So, converting a DataFrame from Pandas to SQL is poor in scenarios with > temporal records. > Doc: > http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html > Affected code: > def _createFromLocal(self, data, schema): > """ > Create an RDD for DataFrame from an list or pandas.DataFrame, returns > the RDD and schema. > """ > if has_pandas and isinstance(data, pandas.DataFrame): > if schema is None: > schema = [str(x) for x in data.columns] > data = [r.tolist() for r in data.to_records(index=False)] # HERE > # ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas
[ https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567329#comment-15567329 ] Leandro Ferrado edited comment on SPARK-11758 at 10/12/16 2:39 AM: --- Hi Holden. First, I would add just a single line in order to avoid the bad conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) converts a Date column into a LongInt column). The idea is to first convert all columns into string types, thus the function DataFrame.to_records(index=False) wouldn't make bad conversions with datetime.datetime objects. However, that can be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of strings or if we didn't define an schema (in that case, the function create an schema of strings). So, the modification is only present on the condition 'schema=None' and the snippet would be: --- if has_pandas and isinstance(data, pandas.DataFrame): if schema is None: # begin if clause# schema = [str(x) for x in data.columns] data = data.astype(str) # Converting all fields on string objects because we don't have a defined schema # end if clause# data = [r.tolist() for r in data.to_records(index=False)] --- In case of having an schema with timestamps (e.g. TimestampType() or DateType()), it is needed a prior conversion between datetime.datetime objects on Python to a convenient format for pyspark DataFrames. Regarding to the 'index=False' term, so far I can't figure out an scenario in which it is needed an index per row on a DataFrame. So it may be fine that argument on the function, I'm not sure. was (Author: leferrad): Hi Holden. First, I would add just a single line in order to avoid the bad conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) converts a Date column into a LongInt column). The idea is to first convert all columns into string types, thus the function DataFrame.to_records(index=False) wouldn't make bad conversions with datetime.datetime objects. However, that can be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of strings or if we didn't define an schema (in that case, the function create an schema of strings). So, the modification is only present on the condition 'schema=None' and the snippet would be: --- if has_pandas and isinstance(data, pandas.DataFrame): if schema is None: schema = [str(x) for x in data.columns] data = data.astype(str) # Converting all fields on string objects because we don't have a defined schema data = [r.tolist() for r in data.to_records(index=False)] --- In case of having an schema with timestamps (e.g. TimestampType() or DateType()), it is needed a prior conversion between datetime.datetime objects on Python to a convenient format for pyspark DataFrames. Regarding to the 'index=False' term, so far I can't figure out an scenario in which it is needed an index per row on a DataFrame. So it may be fine that argument on the function, I'm not sure. > Missing Index column while creating a DataFrame from Pandas > > > Key: SPARK-11758 > URL: https://issues.apache.org/jira/browse/SPARK-11758 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 > Environment: Linux Debian, PySpark, in local testing. >Reporter: Leandro Ferrado >Priority: Minor > Original Estimate: 5h > Remaining Estimate: 5h > > In PySpark's SQLContext, when it invokes createDataFrame() from a > pandas.DataFrame and indicating a 'schema' with StructFields, the function > _createFromLocal() converts the pandas.DataFrame but ignoring two points: > - Index column, because the flag index=False > - Timestamp's records, because a Date column can't be index and Pandas > doesn't converts its records in Timestamp's type. > So, converting a DataFrame from Pandas to SQL is poor in scenarios with > temporal records. > Doc: > http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html > Affected code: > def _createFromLocal(self, data, schema): > """ > Create an RDD for DataFrame from an list or pandas.DataFrame, returns > the RDD and schema. > """ > if has_pandas and isinstance(data, pandas.DataFrame): > if schema is None: > schema = [str(x) for x in data.columns] > data = [r.tolist() for r in data.to_records(index=False)] # HERE > # ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@
[jira] [Issue Comment Deleted] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas
[ https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leandro Ferrado updated SPARK-11758: Comment: was deleted (was: Hi Holden. First, I would add just a single line in order to avoid the bad conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) converts a Date column into a LongInt column). The idea is to first convert all columns into string types, thus the function DataFrame.to_records(index=False) wouldn't make bad conversions with datetime.datetime objects. However, that can be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of strings or if we didn't define an schema (in that case, the function create an schema of strings). So, the modification is only present on the condition 'schema=None' and the snippet would be: --- if has_pandas and isinstance(data, pandas.DataFrame): if schema is None: # begin if clause# schema = [str(x) for x in data.columns] data = data.astype(str) # Converting all fields on string objects because we don't have a defined schema # end if clause# data = [r.tolist() for r in data.to_records(index=False)] --- In case of having an schema with timestamps (e.g. TimestampType() or DateType()), it is needed a prior conversion between datetime.datetime objects on Python to a convenient format for pyspark DataFrames. Regarding to the 'index=False' term, so far I can't figure out an scenario in which it is needed an index per row on a DataFrame. So it may be fine that argument on the function, I'm not sure.) > Missing Index column while creating a DataFrame from Pandas > > > Key: SPARK-11758 > URL: https://issues.apache.org/jira/browse/SPARK-11758 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 > Environment: Linux Debian, PySpark, in local testing. >Reporter: Leandro Ferrado >Priority: Minor > Original Estimate: 5h > Remaining Estimate: 5h > > In PySpark's SQLContext, when it invokes createDataFrame() from a > pandas.DataFrame and indicating a 'schema' with StructFields, the function > _createFromLocal() converts the pandas.DataFrame but ignoring two points: > - Index column, because the flag index=False > - Timestamp's records, because a Date column can't be index and Pandas > doesn't converts its records in Timestamp's type. > So, converting a DataFrame from Pandas to SQL is poor in scenarios with > temporal records. > Doc: > http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html > Affected code: > def _createFromLocal(self, data, schema): > """ > Create an RDD for DataFrame from an list or pandas.DataFrame, returns > the RDD and schema. > """ > if has_pandas and isinstance(data, pandas.DataFrame): > if schema is None: > schema = [str(x) for x in data.columns] > data = [r.tolist() for r in data.to_records(index=False)] # HERE > # ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas
[ https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567329#comment-15567329 ] Leandro Ferrado commented on SPARK-11758: - Hi Holden. First, I would add just a single line in order to avoid the bad conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) converts a Date column into a LongInt column). The idea is to first convert all columns into string types, thus the function DataFrame.to_records(index=False) wouldn't make bad conversions with datetime.datetime objects. However, that can be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of strings or if we didn't define an schema (in that case, the function create an schema of strings). So, the modification is only present on the condition 'schema=None' and the snippet would be: --- if has_pandas and isinstance(data, pandas.DataFrame): if schema is None: schema = [str(x) for x in data.columns] data = data.astype(str) # Converting all fields on string objects because we don't have a defined schema data = [r.tolist() for r in data.to_records(index=False)] --- In case of having an schema with timestamps (e.g. TimestampType() or DateType()), it is needed a prior conversion between datetime.datetime objects on Python to a convenient format for pyspark DataFrames. Regarding to the 'index=False' term, so far I can't figure out an scenario in which it is needed an index per row on a DataFrame. So it may be fine that argument on the function, I'm not sure. > Missing Index column while creating a DataFrame from Pandas > > > Key: SPARK-11758 > URL: https://issues.apache.org/jira/browse/SPARK-11758 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 > Environment: Linux Debian, PySpark, in local testing. >Reporter: Leandro Ferrado >Priority: Minor > Original Estimate: 5h > Remaining Estimate: 5h > > In PySpark's SQLContext, when it invokes createDataFrame() from a > pandas.DataFrame and indicating a 'schema' with StructFields, the function > _createFromLocal() converts the pandas.DataFrame but ignoring two points: > - Index column, because the flag index=False > - Timestamp's records, because a Date column can't be index and Pandas > doesn't converts its records in Timestamp's type. > So, converting a DataFrame from Pandas to SQL is poor in scenarios with > temporal records. > Doc: > http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html > Affected code: > def _createFromLocal(self, data, schema): > """ > Create an RDD for DataFrame from an list or pandas.DataFrame, returns > the RDD and schema. > """ > if has_pandas and isinstance(data, pandas.DataFrame): > if schema is None: > schema = [str(x) for x in data.columns] > data = [r.tolist() for r in data.to_records(index=False)] # HERE > # ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567321#comment-15567321 ] Michael Armbrust commented on SPARK-17344: -- These are good questions. A few thoughts: bq. How long would it take CDH to distribute 0.10 if there was a compelling Spark client for it? Even if they were going to release kafka 0.10 in CDH yesterday, my experience is that many will take a long time for people to upgrade. We spent a fair amount of effort on multi-version compatibility for Hive in Spark SQL and it was great boost for adoption. I think this could be the same thing. bq. How are you going to handle SSL? You can't avoid the complexity of caching consumers if you still want the benefits of prefetching, and doing an SSL handshake for every batch will kill performance if they aren't cached. An option here would be to use the internal client directly. This way we can leverage all the work that they did to support SSL, etc yet make it speak specific versions of the protocol as we need. I did a [really rough prototype|https://gist.github.com/marmbrus/7d116b0a9672337497ddfccc0657dbf0] using the APIs described above and it is not that much code. There is clearly a lot more we'd need to do, but I think we should strongly consider this option. Caching connections to the specific brokers should probably still be implemented for the reasons you describe (and this is already handled by the internal client). An advantage here is you'd actually be able to share connections across queries without running into correctness problems. > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567271#comment-15567271 ] Ryan Brant edited comment on SPARK-12484 at 10/12/16 2:04 AM: -- Was there a resolution to this? I am also getting this issue in Scala. I am currently using Spark 2.0 was (Author: brantrm): Was there a resolution to this? I am also getting this issue in Scala. > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567271#comment-15567271 ] Ryan Brant commented on SPARK-12484: Was there a resolution to this? I am also getting this issue in Scala. > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17870: Assignee: Apache Spark > ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong > > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Assignee: Apache Spark >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567244#comment-15567244 ] Apache Spark commented on SPARK-17870: -- User 'mpjlu' has created a pull request for this issue: https://github.com/apache/spark/pull/15444 > ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong > > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17870: Assignee: (was: Apache Spark) > ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong > > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17881) Aggregation function for generating string histograms
[ https://issues.apache.org/jira/browse/SPARK-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567226#comment-15567226 ] Apache Spark commented on SPARK-17881: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/15443 > Aggregation function for generating string histograms > - > > Key: SPARK-17881 > URL: https://issues.apache.org/jira/browse/SPARK-17881 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > This agg function generates equi-width histograms for string type columns, > with a maximum number of histogram bins. It returns a empty result if the > ndv(number of distinct value) of the column exceeds the maximum number > allowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17881) Aggregation function for generating string histograms
[ https://issues.apache.org/jira/browse/SPARK-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17881: Assignee: Apache Spark > Aggregation function for generating string histograms > - > > Key: SPARK-17881 > URL: https://issues.apache.org/jira/browse/SPARK-17881 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark > > This agg function generates equi-width histograms for string type columns, > with a maximum number of histogram bins. It returns a empty result if the > ndv(number of distinct value) of the column exceeds the maximum number > allowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17881) Aggregation function for generating string histograms
[ https://issues.apache.org/jira/browse/SPARK-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17881: Assignee: (was: Apache Spark) > Aggregation function for generating string histograms > - > > Key: SPARK-17881 > URL: https://issues.apache.org/jira/browse/SPARK-17881 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > This agg function generates equi-width histograms for string type columns, > with a maximum number of histogram bins. It returns a empty result if the > ndv(number of distinct value) of the column exceeds the maximum number > allowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17881) Aggregation function for generating string histograms
Zhenhua Wang created SPARK-17881: Summary: Aggregation function for generating string histograms Key: SPARK-17881 URL: https://issues.apache.org/jira/browse/SPARK-17881 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.0 Reporter: Zhenhua Wang This agg function generates equi-width histograms for string type columns, with a maximum number of histogram bins. It returns a empty result if the ndv(number of distinct value) of the column exceeds the maximum number allowed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17853) Kafka OffsetOutOfRangeException on DStreams union from separate Kafka clusters with identical topic names.
[ https://issues.apache.org/jira/browse/SPARK-17853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567199#comment-15567199 ] Apache Spark commented on SPARK-17853: -- User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/15442 > Kafka OffsetOutOfRangeException on DStreams union from separate Kafka > clusters with identical topic names. > -- > > Key: SPARK-17853 > URL: https://issues.apache.org/jira/browse/SPARK-17853 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Marcin Kuthan > > During migration from Spark 1.6 to 2.0 I observed OffsetOutOfRangeException > reported by Kafka client. In our scenario we create single DStream as a union > of multiple DStreams. One DStream for one Kafka cluster (multi dc solution). > Both Kafka clusters have the same topics and number of partitions. > After quick investigation, I found that class DirectKafkaInputDStream keeps > offset state for topic and partitions, but it is not aware of different Kafka > clusters. > For every topic, single DStream is created as a union from all configured > Kafka clusters. > {code} > class KafkaDStreamSource(configs: Iterable[Map[String, String]]) { > def createSource(ssc: StreamingContext, topic: String): DStream[(String, > Array[Byte])] = { > val streams = configs.map { config => > val kafkaParams = config > val kafkaTopics = Set(topic) > KafkaUtils. > createDirectStream[String, Array[Byte]]( > ssc, > LocationStrategies.PreferConsistent, > ConsumerStrategies.Subscribe[String, Array[Byte]](kafkaTopics, > kafkaParams) > ).map { record => > (record.key, record.value) > } > } > ssc.union(streams.toSeq) > } > } > {code} > At the end, offsets from one Kafka cluster overwrite offsets from second one. > Fortunately OffsetOutOfRangeException was thrown because offsets in both > Kafka clusters are significantly different. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17853) Kafka OffsetOutOfRangeException on DStreams union from separate Kafka clusters with identical topic names.
[ https://issues.apache.org/jira/browse/SPARK-17853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17853: Assignee: (was: Apache Spark) > Kafka OffsetOutOfRangeException on DStreams union from separate Kafka > clusters with identical topic names. > -- > > Key: SPARK-17853 > URL: https://issues.apache.org/jira/browse/SPARK-17853 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Marcin Kuthan > > During migration from Spark 1.6 to 2.0 I observed OffsetOutOfRangeException > reported by Kafka client. In our scenario we create single DStream as a union > of multiple DStreams. One DStream for one Kafka cluster (multi dc solution). > Both Kafka clusters have the same topics and number of partitions. > After quick investigation, I found that class DirectKafkaInputDStream keeps > offset state for topic and partitions, but it is not aware of different Kafka > clusters. > For every topic, single DStream is created as a union from all configured > Kafka clusters. > {code} > class KafkaDStreamSource(configs: Iterable[Map[String, String]]) { > def createSource(ssc: StreamingContext, topic: String): DStream[(String, > Array[Byte])] = { > val streams = configs.map { config => > val kafkaParams = config > val kafkaTopics = Set(topic) > KafkaUtils. > createDirectStream[String, Array[Byte]]( > ssc, > LocationStrategies.PreferConsistent, > ConsumerStrategies.Subscribe[String, Array[Byte]](kafkaTopics, > kafkaParams) > ).map { record => > (record.key, record.value) > } > } > ssc.union(streams.toSeq) > } > } > {code} > At the end, offsets from one Kafka cluster overwrite offsets from second one. > Fortunately OffsetOutOfRangeException was thrown because offsets in both > Kafka clusters are significantly different. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17853) Kafka OffsetOutOfRangeException on DStreams union from separate Kafka clusters with identical topic names.
[ https://issues.apache.org/jira/browse/SPARK-17853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17853: Assignee: Apache Spark > Kafka OffsetOutOfRangeException on DStreams union from separate Kafka > clusters with identical topic names. > -- > > Key: SPARK-17853 > URL: https://issues.apache.org/jira/browse/SPARK-17853 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Marcin Kuthan >Assignee: Apache Spark > > During migration from Spark 1.6 to 2.0 I observed OffsetOutOfRangeException > reported by Kafka client. In our scenario we create single DStream as a union > of multiple DStreams. One DStream for one Kafka cluster (multi dc solution). > Both Kafka clusters have the same topics and number of partitions. > After quick investigation, I found that class DirectKafkaInputDStream keeps > offset state for topic and partitions, but it is not aware of different Kafka > clusters. > For every topic, single DStream is created as a union from all configured > Kafka clusters. > {code} > class KafkaDStreamSource(configs: Iterable[Map[String, String]]) { > def createSource(ssc: StreamingContext, topic: String): DStream[(String, > Array[Byte])] = { > val streams = configs.map { config => > val kafkaParams = config > val kafkaTopics = Set(topic) > KafkaUtils. > createDirectStream[String, Array[Byte]]( > ssc, > LocationStrategies.PreferConsistent, > ConsumerStrategies.Subscribe[String, Array[Byte]](kafkaTopics, > kafkaParams) > ).map { record => > (record.key, record.value) > } > } > ssc.union(streams.toSeq) > } > } > {code} > At the end, offsets from one Kafka cluster overwrite offsets from second one. > Fortunately OffsetOutOfRangeException was thrown because offsets in both > Kafka clusters are significantly different. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data
[ https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567180#comment-15567180 ] Hossein Falaki commented on SPARK-17878: Sure. If passing a list is possible it is the better choice. I just don't want to block this feature on API change in SparkSQL. > Support for multiple null values when reading CSV data > -- > > Key: SPARK-17878 > URL: https://issues.apache.org/jira/browse/SPARK-17878 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > There are CSV files out there with multiple values that are supposed to be > interpreted as null. As a result, multiple spark users have asked for this > feature built into the CSV data source. It can be easily implemented in a > backwards compatible way: > - Currently CSV data source supports an option named {{nullValue}}. > - We can add logic in {{CSVOptions}} to understands option names that match > {{nullValue[\d]}}. This way user can specify a query with multiple or one > null value. > {code} > val df = spark.read.format("CSV").option("nullValue1", > "-").option("nullValue2", "*") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567170#comment-15567170 ] Peng Meng commented on SPARK-17870: --- hi [~avulanov], the question here is not use raw chi2 scores or pvalues, the question is if use raw chi2 scores, the DoF should be the same. "chi2-test is used multiple times" is another problem. According to (http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html),"whenever a statistical test is used multiple times, then the probability of getting at least one error increases.", this problem is partially solved by Select the p-values corresponding to Family-wise error rate (SelectFwe, SPARK-17645). Thanks very much. Hi [~srowen], I totally agree with your comments. Based on the DoF is different in Spark ChiSquare value, we can use the p-values for Spark SelectKBest, and SelectPercentile. Thanks very much. I will submit a pr for this. > ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong > > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data
[ https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567149#comment-15567149 ] Hyukjin Kwon commented on SPARK-17878: -- BTW, maybe, I will try to investigate further if it is really possible (setting a list as the value in options) if you think both ideas are okay. > Support for multiple null values when reading CSV data > -- > > Key: SPARK-17878 > URL: https://issues.apache.org/jira/browse/SPARK-17878 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > There are CSV files out there with multiple values that are supposed to be > interpreted as null. As a result, multiple spark users have asked for this > feature built into the CSV data source. It can be easily implemented in a > backwards compatible way: > - Currently CSV data source supports an option named {{nullValue}}. > - We can add logic in {{CSVOptions}} to understands option names that match > {{nullValue[\d]}}. This way user can specify a query with multiple or one > null value. > {code} > val df = spark.read.format("CSV").option("nullValue1", > "-").option("nullValue2", "*") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17878) Support for multiple null values when reading CSV data
[ https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567124#comment-15567124 ] Hyukjin Kwon edited comment on SPARK-17878 at 10/12/16 12:50 AM: - Oh, I didn't mean I am against this. I am just wondering if it is just possible to deal with this in general. If it is not easy for now, I'd rather support this idea if we should deal with this problem. (Actually, one of the votes is from me :)) was (Author: hyukjin.kwon): Oh, I didn't mean I am against this. I am just wondering if it is just possible to deal with this in general. If it is not easy for now, I support this idea. > Support for multiple null values when reading CSV data > -- > > Key: SPARK-17878 > URL: https://issues.apache.org/jira/browse/SPARK-17878 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > There are CSV files out there with multiple values that are supposed to be > interpreted as null. As a result, multiple spark users have asked for this > feature built into the CSV data source. It can be easily implemented in a > backwards compatible way: > - Currently CSV data source supports an option named {{nullValue}}. > - We can add logic in {{CSVOptions}} to understands option names that match > {{nullValue[\d]}}. This way user can specify a query with multiple or one > null value. > {code} > val df = spark.read.format("CSV").option("nullValue1", > "-").option("nullValue2", "*") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data
[ https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567124#comment-15567124 ] Hyukjin Kwon commented on SPARK-17878: -- Oh, I didn't mean I am against this. I am just wondering if it is just possible to deal with this in general. If it is not easy for now, I support this idea. > Support for multiple null values when reading CSV data > -- > > Key: SPARK-17878 > URL: https://issues.apache.org/jira/browse/SPARK-17878 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > There are CSV files out there with multiple values that are supposed to be > interpreted as null. As a result, multiple spark users have asked for this > feature built into the CSV data source. It can be easily implemented in a > backwards compatible way: > - Currently CSV data source supports an option named {{nullValue}}. > - We can add logic in {{CSVOptions}} to understands option names that match > {{nullValue[\d]}}. This way user can specify a query with multiple or one > null value. > {code} > val df = spark.read.format("CSV").option("nullValue1", > "-").option("nullValue2", "*") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data
[ https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567104#comment-15567104 ] Hossein Falaki commented on SPARK-17878: That would require API change in SparkSQL. Otherwise, we need to split the string passed to {{nullValue}} on a delimiter. Neither these are safe choices. I think accepting {{nullValue1}}, {{nullValue2}}, etc (along with existing {{nullValue}}) is: * backwards compatible * clear * extensible for other options in future. E.g., quoteCharacter, etc. > Support for multiple null values when reading CSV data > -- > > Key: SPARK-17878 > URL: https://issues.apache.org/jira/browse/SPARK-17878 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > There are CSV files out there with multiple values that are supposed to be > interpreted as null. As a result, multiple spark users have asked for this > feature built into the CSV data source. It can be easily implemented in a > backwards compatible way: > - Currently CSV data source supports an option named {{nullValue}}. > - We can add logic in {{CSVOptions}} to understands option names that match > {{nullValue[\d]}}. This way user can specify a query with multiple or one > null value. > {code} > val df = spark.read.format("CSV").option("nullValue1", > "-").option("nullValue2", "*") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data
[ https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567046#comment-15567046 ] Hyukjin Kwon commented on SPARK-17878: -- Maybe it'd be nicer if options allow list or nested map (if possible). I noticed {{read.csv}} in R can use {{na.strings}} as a list which is actually currently being mapped to {{nullValue}} as string.. > Support for multiple null values when reading CSV data > -- > > Key: SPARK-17878 > URL: https://issues.apache.org/jira/browse/SPARK-17878 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > There are CSV files out there with multiple values that are supposed to be > interpreted as null. As a result, multiple spark users have asked for this > feature built into the CSV data source. It can be easily implemented in a > backwards compatible way: > - Currently CSV data source supports an option named {{nullValue}}. > - We can add logic in {{CSVOptions}} to understands option names that match > {{nullValue[\d]}}. This way user can specify a query with multiple or one > null value. > {code} > val df = spark.read.format("CSV").option("nullValue1", > "-").option("nullValue2", "*") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4411) Add "kill" link for jobs in the UI
[ https://issues.apache.org/jira/browse/SPARK-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567028#comment-15567028 ] Apache Spark commented on SPARK-4411: - User 'ajbozarth' has created a pull request for this issue: https://github.com/apache/spark/pull/15441 > Add "kill" link for jobs in the UI > -- > > Key: SPARK-4411 > URL: https://issues.apache.org/jira/browse/SPARK-4411 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Kay Ousterhout > > SPARK-4145 changes the default landing page for the UI to show jobs. We > should have a "kill" link for each job, similar to what we have for each > stage, so it's easier for users to kill slow jobs (and the semantics of > killing a job are slightly different than killing a stage). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17880) The url linking to `AccumulatorV2` in the document is incorrect.
[ https://issues.apache.org/jira/browse/SPARK-17880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17880: Assignee: (was: Apache Spark) > The url linking to `AccumulatorV2` in the document is incorrect. > > > Key: SPARK-17880 > URL: https://issues.apache.org/jira/browse/SPARK-17880 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Kousuke Saruta >Priority: Minor > > In `programming-guide.md`, the url which links to `AccumulatorV2` says > `api/scala/index.html#org.apache.spark.AccumulatorV2` but > `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17880) The url linking to `AccumulatorV2` in the document is incorrect.
[ https://issues.apache.org/jira/browse/SPARK-17880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566936#comment-15566936 ] Apache Spark commented on SPARK-17880: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/15439 > The url linking to `AccumulatorV2` in the document is incorrect. > > > Key: SPARK-17880 > URL: https://issues.apache.org/jira/browse/SPARK-17880 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Kousuke Saruta >Priority: Minor > > In `programming-guide.md`, the url which links to `AccumulatorV2` says > `api/scala/index.html#org.apache.spark.AccumulatorV2` but > `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17880) The url linking to `AccumulatorV2` in the document is incorrect.
[ https://issues.apache.org/jira/browse/SPARK-17880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17880: Assignee: Apache Spark > The url linking to `AccumulatorV2` in the document is incorrect. > > > Key: SPARK-17880 > URL: https://issues.apache.org/jira/browse/SPARK-17880 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > In `programming-guide.md`, the url which links to `AccumulatorV2` says > `api/scala/index.html#org.apache.spark.AccumulatorV2` but > `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17880) The url linking to `AccumulatorV2` in the document is incorrect.
Kousuke Saruta created SPARK-17880: -- Summary: The url linking to `AccumulatorV2` in the document is incorrect. Key: SPARK-17880 URL: https://issues.apache.org/jira/browse/SPARK-17880 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.0.1 Reporter: Kousuke Saruta Priority: Minor In `programming-guide.md`, the url which links to `AccumulatorV2` says `api/scala/index.html#org.apache.spark.AccumulatorV2` but `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15621) BatchEvalPythonExec fails with OOM
[ https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566858#comment-15566858 ] Davies Liu commented on SPARK-15621: [~rezasafi] We usually do not backport this kind of improvements, it's too large and risky for maintain release (2.0.X), sorry for that. > BatchEvalPythonExec fails with OOM > -- > > Key: SPARK-15621 > URL: https://issues.apache.org/jira/browse/SPARK-15621 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Krisztian Szucs >Assignee: Davies Liu > Fix For: 2.1.0 > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40 > No matter what the queue grows unboundedly and fails with OOM, even with > identity `lambda x: x` udf function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15621) BatchEvalPythonExec fails with OOM
[ https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566752#comment-15566752 ] Reza Safi commented on SPARK-15621: --- Hi [~davies], can the fix be backported to branch-2.0 as well, since it affects version 2.0.0 as well? Thank you very much in advance. > BatchEvalPythonExec fails with OOM > -- > > Key: SPARK-15621 > URL: https://issues.apache.org/jira/browse/SPARK-15621 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Krisztian Szucs >Assignee: Davies Liu > Fix For: 2.1.0 > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40 > No matter what the queue grows unboundedly and fails with OOM, even with > identity `lambda x: x` udf function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566741#comment-15566741 ] K commented on SPARK-16845: --- We manually wrote parts that were throwing errors (StringIndexer and FeatureAssembler) in RDD and converted to DataFrame to run RandomForestClassifier. > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: Java API, ML, MLlib >Affects Versions: 2.0.0 >Reporter: hejie > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17387) Creating SparkContext() from python without spark-submit ignores user conf
[ https://issues.apache.org/jira/browse/SPARK-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-17387. Resolution: Fixed Assignee: Jeff Zhang Fix Version/s: 2.1.0 > Creating SparkContext() from python without spark-submit ignores user conf > -- > > Key: SPARK-17387 > URL: https://issues.apache.org/jira/browse/SPARK-17387 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Jeff Zhang >Priority: Minor > Fix For: 2.1.0 > > > Consider the following scenario: user runs a python application not through > spark-submit, but by adding the pyspark module and manually creating a Spark > context. Kinda like this: > {noformat} > $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python > Python 2.7.12 (default, Jul 1 2016, 15:12:24) > [GCC 5.4.0 20160609] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> from pyspark import SparkContext > >>> from pyspark import SparkConf > >>> conf = SparkConf().set("spark.driver.memory", "4g") > >>> sc = SparkContext(conf=conf) > {noformat} > If you look at the JVM launched by the pyspark code, it ignores the user's > configuration: > {noformat} > $ ps ax | grep $(pgrep -f SparkSubmit) > 12283 pts/2Sl+0:03 /apps/java7/bin/java -cp ... -Xmx1g > -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell > {noformat} > Note the "1g" of memory. If instead you use "pyspark", you get the correct > "4g" in the JVM. > This also affects other configs; for example, you can't really add jars to > the driver's classpath using "spark.jars". > You can work around this by setting the undocumented env variable Spark > itself uses: > {noformat} > $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python > Python 2.7.12 (default, Jul 1 2016, 15:12:24) > [GCC 5.4.0 20160609] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import os > >>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf > >>> spark.driver.memory=4g" > >>> from pyspark import SparkContext > >>> sc = SparkContext() > {noformat} > But it would be nicer if the configs were automatically propagated. > BTW the reason for this is that the {{launch_gateway}} function used to start > the JVM does not take any parameters, and the only place where it reads > arguments for Spark is that env variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph
[ https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Pivovarov updated SPARK-17877: Description: The following code demonstrates the issue {code} import org.apache.spark.graphx._ val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L -> "kelly")) val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) sc.setCheckpointDir("/tmp/check") val g = Graph(users, rel) g.checkpoint // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears val gg = g.connectedComponents gg.checkpoint gg.vertices.collect gg.edges.collect gg.isCheckpointed // res5: Boolean = false, /tmp/check still contains only 1 folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f {code} I think the last line should return true instead of false was: The following code demonstrates the issue {code} import org.apache.spark.graphx._ val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L -> "kelly")) val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) sc.setCheckpointDir("/tmp/check") val g = Graph(users, rel) g.checkpoint // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears val gg = g.connectedComponents gg.checkpoint gg.vertices.collect gg.edges.collect gg.isCheckpointed // res5: Boolean = false, /tmp/check/ contains only 1 folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f {code} I think the last line should return true instead of false > Can not checkpoint connectedComponents resulting graph > -- > > Key: SPARK-17877 > URL: https://issues.apache.org/jira/browse/SPARK-17877 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.5.2, 1.6.2, 2.0.1 >Reporter: Alexander Pivovarov >Priority: Minor > > The following code demonstrates the issue > {code} > import org.apache.spark.graphx._ > val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L > -> "kelly")) > val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, > "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) > sc.setCheckpointDir("/tmp/check") > val g = Graph(users, rel) > g.checkpoint // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears > val gg = g.connectedComponents > gg.checkpoint > gg.vertices.collect > gg.edges.collect > gg.isCheckpointed // res5: Boolean = false, /tmp/check still contains only > 1 folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f > {code} > I think the last line should return true instead of false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9879) OOM in LIMIT clause with large number
[ https://issues.apache.org/jira/browse/SPARK-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566639#comment-15566639 ] Dongjoon Hyun commented on SPARK-9879: -- Hi, All. The PR seems to be closed last December. Can we close this if this is not reproduced? > OOM in LIMIT clause with large number > - > > Key: SPARK-9879 > URL: https://issues.apache.org/jira/browse/SPARK-9879 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao > > {code} > create table spark.tablsetest as select * from dpa_ord_bill_tf order by > member_id limit 2000; > {code} > > {code} > spark-sql --driver-memory 48g --executor-memory 24g --driver-java-options > -XX:PermSize=1024M -XX:MaxPermSize=2048M > Error logs > 15/07/27 10:22:43 ERROR ActorSystemImpl: Uncaught fatal error from thread > [sparkDriver-akka.actor.default-dispatcher-20]shutting down ActorSystem > [sparkDriver] > java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852) > at java.io.ObjectOutputStream.write(ObjectOutputStream.java:708) > at org.apache.spark.util.Utils$$anon$2.write(Utils.scala:134) > at com.esotericsoftware.kryo.io.Output.flush(Output.java:155) > at com.esotericsoftware.kryo.io.Output.close(Output.java:165) > at > org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:162) > at org.apache.spark.util.Utils$.serializeViaNestedStream(Utils.scala:139) > at > org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$writeObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:65) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239) > at > org.apache.spark.rdd.ParallelCollectionPartition.writeObject(ParallelCollectionRDD.scala:51) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81) > at org.apache.spark.scheduler.Task$.serializeWithDependencies(Task.scala:168) > at > org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:467) > at > org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:231) > 15/07/27 10:22:43 ERROR ErrorMonitor: Uncaught fatal error from thread > [sparkDriver-akka.actor.default-dispatcher-20]shutting down ActorSystem > [sparkDriver] > java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at java.util.Arrays.copyOf(Arrays.java:2271) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) > at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) > at > java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852) > at java.io.ObjectOutputStream.write(ObjectOutputStream.java:708) > at org.apache.spark.util.Utils$$anon$2.write(Utils.scala:134) > at com.esotericsoftware.kryo.io.Output.flush(Output.java:155) > at com.esotericsoftware.kryo.io.Output.close(Output.java:165) > at > org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:162) > at org.apache.spark.util.Utils$.serializeViaNestedStream(Utils.scala:139) > at > org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$writeObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:65) > at org.apache.spark.util.Utils$.tryOrIOExcep
[jira] [Created] (SPARK-17879) Don't compact metadata logs constantly into a single compacted file
Burak Yavuz created SPARK-17879: --- Summary: Don't compact metadata logs constantly into a single compacted file Key: SPARK-17879 URL: https://issues.apache.org/jira/browse/SPARK-17879 Project: Spark Issue Type: Bug Components: SQL, Streaming Affects Versions: 2.0.1 Reporter: Burak Yavuz With metadata log compaction, we compact all files into a single file every "n" batches. The problem is, over time, this single file becomes huge, and could become an issue to constantly write out in the driver. It would be a good idea to cap the compacted file size, so that we don't end up writing huge files in the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566586#comment-15566586 ] Sean Owen commented on SPARK-17870: --- If the degrees of freedom are the same across the tests, then ranking on p-value or statistic should give the same ranking because the p-value is a monotonically decreasing function of the statistic. That's the case in what the scikit code is effectively doing because there are always (# label classes - 1) degrees of freedom. Really the p-value is the comparable quantity, but there's no point computing it in this case because it's just for ranking. The Spark code performs a chi-squared test but applies it to answer a different question, where DOF is no longer the same; it's (# label classes - 1) * (# feature classes - 1) in the contingency table here. p-value is no longer always smaller when the statistic is larger. So it's necessary to actually use the p-values for what Spark is doing. > ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong > > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17878) Support for multiple null values when reading CSV data
Hossein Falaki created SPARK-17878: -- Summary: Support for multiple null values when reading CSV data Key: SPARK-17878 URL: https://issues.apache.org/jira/browse/SPARK-17878 Project: Spark Issue Type: Story Components: SQL Affects Versions: 2.0.1 Reporter: Hossein Falaki There are CSV files out there with multiple values that are supposed to be interpreted as null. As a result, multiple spark users have asked for this feature built into the CSV data source. It can be easily implemented in a backwards compatible way: - Currently CSV data source supports an option named {{nullValue}}. - We can add logic in {{CSVOptions}} to understands option names that match {{nullValue[\d]}}. This way user can specify a query with multiple or one null value. {code} val df = spark.read.format("CSV").option("nullValue1", "-").option("nullValue2", "*") {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs
[ https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nic Eggert updated SPARK-17455: --- Priority: Major (was: Minor) > IsotonicRegression takes non-polynomial time for some inputs > > > Key: SPARK-17455 > URL: https://issues.apache.org/jira/browse/SPARK-17455 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0 >Reporter: Nic Eggert > > The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently > in MLlib can take O(N!) time for certain inputs, when it should have > worst-case complexity of O(N^2). > To reproduce this, I pulled the private method poolAdjacentViolators out of > mllib.regression.IsotonicRegression and into a benchmarking harness. > Given this input > {code} > val x = (1 to length).toArray.map(_.toDouble) > val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 > else yi} > val w = Array.fill(length)(1d) > val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, > x), w) => (y, x, w)} > {code} > I vary the length of the input to get these timings: > || Input Length || Time (us) || > | 100 | 1.35 | > | 200 | 3.14 | > | 400 | 116.10 | > | 800 | 2134225.90 | > (tests were performed using > https://github.com/sirthias/scala-benchmarking-template) > I can also confirm that I run into this issue on a real dataset I'm working > on when trying to calibrate random forest probability output. Some partitions > take > 12 hours to run. This isn't a skew issue, since the largest partitions > finish in minutes. I can only assume that some partitions cause something > approaching this worst-case complexity. > I'm working on a patch that borrows the implementation that is used in > scikit-learn and the R "iso" package, both of which handle this particular > input in linear time and are quadratic in the worst case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()
[ https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566541#comment-15566541 ] Hossein Falaki commented on SPARK-17781: [~shivaram] Thanks for looking into it. I think the problem applies to {{dapply}} as well. For example this fails: {code} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > collect(dapply(df, function(x) {data.frame(res = x$date)}, schema = > structType(structField("res", "date" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 52.0 failed 4 times, most recent failure: Lost task 0.3 in stage 52.0 (TID 10114, 10.0.229.211): java.lang.RuntimeException: java.lang.Double is not a valid external type for schema of date {code} I spent a few hours getting to the root of it. We have the correct type all the way until {{readList}} in {{deserialize.R}}. I instrumented that function. We get the correct type from {{readObject()}} but once it is placed in the list it loses its type. > datetime is serialized as double inside dapply() > > > Key: SPARK-17781 > URL: https://issues.apache.org/jira/browse/SPARK-17781 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > When we ship a SparkDataFrame to workers for dapply family functions, inside > the worker DateTime objects are serialized as double. > To reproduce: > {code} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > dapplyCollect(df, function(x) { return(x$date) }) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17863) SELECT distinct does not work if there is a order by clause
[ https://issues.apache.org/jira/browse/SPARK-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-17863: - Description: {code} select distinct struct.a, struct.b from ( select named_struct('a', 1, 'b', 2, 'c', 3) as struct union all select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp order by struct.a, struct.b {code} This query generates {code} +---+---+ | a| b| +---+---+ | 1| 2| | 1| 2| +---+---+ {code} The plan is wrong because the analyze somehow added {{struct#21805}} to the project list, which changes the semantic of the distinct (basically, the query is changed to {{select distinct struct.a, struct.b, struct}} from {{select distinct struct.a, struct.b}}). {code} == Parsed Logical Plan == 'Sort ['struct.a ASC, 'struct.b ASC], true +- 'Distinct +- 'Project ['struct.a, 'struct.b] +- 'SubqueryAlias tmp +- 'Union :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805] : +- OneRowRelation$ +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806] +- OneRowRelation$ == Analyzed Logical Plan == a: int, b: int Project [a#21819, b#21820] +- Sort [struct#21805.a ASC, struct#21805.b ASC], true +- Distinct +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, struct#21805] +- SubqueryAlias tmp +- Union :- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805] : +- OneRowRelation$ +- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806] +- OneRowRelation$ == Optimized Logical Plan == Project [a#21819, b#21820] +- Sort [struct#21805.a ASC, struct#21805.b ASC], true +- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, struct#21805] +- Union :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805] : +- OneRowRelation$ +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806] +- OneRowRelation$ == Physical Plan == *Project [a#21819, b#21820] +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0 +- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200) +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], output=[a#21819, b#21820, struct#21805]) +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200) +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], output=[a#21819, b#21820, struct#21805]) +- Union :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805] : +- Scan OneRowRelation[] +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806] +- Scan OneRowRelation[] {code} was: {code} select distinct struct.a, struct.b from ( select named_struct('a', 1, 'b', 2, 'c', 3) as struct union all select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp order by struct.a, struct.b {code} This query generates {code} +---+---+ | a| b| +---+---+ | 1| 2| | 1| 2| +---+---+ {code} The plan is wrong {code} == Parsed Logical Plan == 'Sort ['struct.a ASC, 'struct.b ASC], true +- 'Distinct +- 'Project ['struct.a, 'struct.b] +- 'SubqueryAlias tmp +- 'Union :- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805] : +- OneRowRelation$ +- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806] +- OneRowRelation$ == Analyzed Logical Plan == a: int, b: int Project [a#21819, b#21820] +- Sort [struct#21805.a ASC, struct#21805.b ASC], true +- Distinct +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, struct#21805] +- SubqueryAlias tmp +- Union :- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805] : +- OneRowRelation$ +- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806] +- OneRowRelation$ == Optimized Logical Plan == Project [a#21819, b#21820] +- Sort [struct#21805.a ASC, struct#21805.b ASC], true +- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, struct#21805] +- Union :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805] : +- OneRowRelation$ +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806] +- OneRowRelation$ == Physical Plan == *Project [a#21819, b#21820] +- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0 +- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200) +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], output=[a#21819, b#21820, struct#21805]) +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200) +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions
[jira] [Comment Edited] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566515#comment-15566515 ] Harish edited comment on SPARK-17463 at 10/11/16 8:34 PM: -- My second approach was: def testfunc(keys, vals, columnsToStandardize): df= pd.DataFrame(vals, columns = keys) df[columnsToStandardize] = df[columnsToStandardize] - df[columnsToStandardize].mean() df3.rdd.map(keys).groupByKey().flatMap(lambda keyval: testfunc(keys[0], keys[1], columnsToStandardize)) was (Author: harishk15): My second approach was: def testfunc(keys, vals, columnsToStandardize): df= pd.DataFrame(vals, columns = keys) df[columnsToStandardize] = df[columnsToStandardize] - df[columnsToStandardize].mean() df3.rdd.map(keys).groupByKey().flatMap(lambda keyval: testfunc(keys[0], keys[1], columnsToStandardize)) > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.write
[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566515#comment-15566515 ] Harish commented on SPARK-17463: My second approach was: def testfunc(keys, vals, columnsToStandardize): df= pd.DataFrame(vals, columns = keys) df[columnsToStandardize] = df[columnsToStandardize] - df[columnsToStandardize].mean() df3.rdd.map(keys).groupByKey().flatMap(lambda keyval: testfunc(keys[0], keys[1], columnsToStandardize)) > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.Obj
[jira] [Assigned] (SPARK-17845) Improve window function frame boundary API in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17845: Assignee: Apache Spark (was: Reynold Xin) > Improve window function frame boundary API in DataFrame > --- > > Key: SPARK-17845 > URL: https://issues.apache.org/jira/browse/SPARK-17845 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > ANSI SQL uses the following to specify the frame boundaries for window > functions: > {code} > ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > {code} > In Spark's DataFrame API, we use integer values to indicate relative position: > - 0 means "CURRENT ROW" > - -1 means "1 PRECEDING" > - Long.MinValue means "UNBOUNDED PRECEDING" > - Long.MaxValue to indicate "UNBOUNDED FOLLOWING" > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetween(Long.MinValue, -3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetween(Long.MinValue, 0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetween(0, Long.MaxValue) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetween(Long.MinValue, Long.MaxValue) > {code} > I think using numeric values to indicate relative positions is actually a > good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate > unbounded ends is pretty confusing: > 1. The API is not self-evident. There is no way for a new user to figure out > how to indicate an unbounded frame by looking at just the API. The user has > to read the doc to figure this out. > 2. It is weird Long.MinValue or Long.MaxValue has some special meaning. > 3. Different languages have different min/max values, e.g. in Python we use > -sys.maxsize and +sys.maxsize. > To make this API less confusing, we have a few options: > Option 1. Add the following (additional) methods: > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) // this one exists already > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetweenUnboundedPrecedingAnd(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetweenUnboundedPrecedingAndCurrentRow() > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetweenCurrentRowAndUnboundedFollowing() > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing() > {code} > This is obviously very verbose, but is very similar to how these functions > are done in SQL, and is perhaps the most obvious to end users, especially if > they come from SQL background. > Option 2. Decouple the specification for frame begin and frame end into two > functions. Assume the boundary is unlimited unless specified. > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsFrom(-3).rowsTo(3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsTo(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsToCurrent() or Window.rowsTo(0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsFromCurrent() or Window.rowsFrom(0) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > // no need to specify > {code} > If we go with option 2, we should throw exceptions if users specify multiple > from's or to's. A variant of option 2 is to require explicitly specification > of begin/end even in the case of unbounded boundary, e.g.: > {code} > Window.rowsFromBeginning().rowsTo(-3) > or > Window.rowsFromUnboundedPreceding().rowsTo(-3) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17845) Improve window function frame boundary API in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17845: Assignee: Reynold Xin (was: Apache Spark) > Improve window function frame boundary API in DataFrame > --- > > Key: SPARK-17845 > URL: https://issues.apache.org/jira/browse/SPARK-17845 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > ANSI SQL uses the following to specify the frame boundaries for window > functions: > {code} > ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > {code} > In Spark's DataFrame API, we use integer values to indicate relative position: > - 0 means "CURRENT ROW" > - -1 means "1 PRECEDING" > - Long.MinValue means "UNBOUNDED PRECEDING" > - Long.MaxValue to indicate "UNBOUNDED FOLLOWING" > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetween(Long.MinValue, -3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetween(Long.MinValue, 0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetween(0, Long.MaxValue) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetween(Long.MinValue, Long.MaxValue) > {code} > I think using numeric values to indicate relative positions is actually a > good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate > unbounded ends is pretty confusing: > 1. The API is not self-evident. There is no way for a new user to figure out > how to indicate an unbounded frame by looking at just the API. The user has > to read the doc to figure this out. > 2. It is weird Long.MinValue or Long.MaxValue has some special meaning. > 3. Different languages have different min/max values, e.g. in Python we use > -sys.maxsize and +sys.maxsize. > To make this API less confusing, we have a few options: > Option 1. Add the following (additional) methods: > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) // this one exists already > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetweenUnboundedPrecedingAnd(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetweenUnboundedPrecedingAndCurrentRow() > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetweenCurrentRowAndUnboundedFollowing() > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing() > {code} > This is obviously very verbose, but is very similar to how these functions > are done in SQL, and is perhaps the most obvious to end users, especially if > they come from SQL background. > Option 2. Decouple the specification for frame begin and frame end into two > functions. Assume the boundary is unlimited unless specified. > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsFrom(-3).rowsTo(3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsTo(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsToCurrent() or Window.rowsTo(0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsFromCurrent() or Window.rowsFrom(0) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > // no need to specify > {code} > If we go with option 2, we should throw exceptions if users specify multiple > from's or to's. A variant of option 2 is to require explicitly specification > of begin/end even in the case of unbounded boundary, e.g.: > {code} > Window.rowsFromBeginning().rowsTo(-3) > or > Window.rowsFromUnboundedPreceding().rowsTo(-3) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17845) Improve window function frame boundary API in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566509#comment-15566509 ] Apache Spark commented on SPARK-17845: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/15438 > Improve window function frame boundary API in DataFrame > --- > > Key: SPARK-17845 > URL: https://issues.apache.org/jira/browse/SPARK-17845 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > ANSI SQL uses the following to specify the frame boundaries for window > functions: > {code} > ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > {code} > In Spark's DataFrame API, we use integer values to indicate relative position: > - 0 means "CURRENT ROW" > - -1 means "1 PRECEDING" > - Long.MinValue means "UNBOUNDED PRECEDING" > - Long.MaxValue to indicate "UNBOUNDED FOLLOWING" > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetween(Long.MinValue, -3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetween(Long.MinValue, 0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetween(0, Long.MaxValue) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetween(Long.MinValue, Long.MaxValue) > {code} > I think using numeric values to indicate relative positions is actually a > good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate > unbounded ends is pretty confusing: > 1. The API is not self-evident. There is no way for a new user to figure out > how to indicate an unbounded frame by looking at just the API. The user has > to read the doc to figure this out. > 2. It is weird Long.MinValue or Long.MaxValue has some special meaning. > 3. Different languages have different min/max values, e.g. in Python we use > -sys.maxsize and +sys.maxsize. > To make this API less confusing, we have a few options: > Option 1. Add the following (additional) methods: > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) // this one exists already > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetweenUnboundedPrecedingAnd(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetweenUnboundedPrecedingAndCurrentRow() > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetweenCurrentRowAndUnboundedFollowing() > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing() > {code} > This is obviously very verbose, but is very similar to how these functions > are done in SQL, and is perhaps the most obvious to end users, especially if > they come from SQL background. > Option 2. Decouple the specification for frame begin and frame end into two > functions. Assume the boundary is unlimited unless specified. > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsFrom(-3).rowsTo(3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsTo(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsToCurrent() or Window.rowsTo(0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsFromCurrent() or Window.rowsFrom(0) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > // no need to specify > {code} > If we go with option 2, we should throw exceptions if users specify multiple > from's or to's. A variant of option 2 is to require explicitly specification > of begin/end even in the case of unbounded boundary, e.g.: > {code} > Window.rowsFromBeginning().rowsTo(-3) > or > Window.rowsFromUnboundedPreceding().rowsTo(-3) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17857) SHOW TABLES IN schema throws exception if schema doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-17857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-17857. - Resolution: Not A Problem Although the behavior is changed from 1.x, we had better close this issue as 'NOT A PROBLEM' in Spark 2.x. I'm closing this now. You can reopen this if you need to do. > SHOW TABLES IN schema throws exception if schema doesn't exist > -- > > Key: SPARK-17857 > URL: https://issues.apache.org/jira/browse/SPARK-17857 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Todd Nemet >Priority: Minor > > SHOW TABLES IN badschema; throws > org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException if badschema > doesn't exist. In Spark 1.x it would return an empty result set. > On Spark 2.0.1: > {code} > [683|12:45:56] ~/Documents/spark/spark$ bin/beeline -u > jdbc:hive2://localhost:10006/ -n hive > Connecting to jdbc:hive2://localhost:10006/ > 16/10/10 12:46:00 INFO jdbc.Utils: Supplied authorities: localhost:10006 > 16/10/10 12:46:00 INFO jdbc.Utils: Resolved authority: localhost:10006 > 16/10/10 12:46:00 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10006/ > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 1.2.1.spark2 by Apache Hive > 0: jdbc:hive2://localhost:10006/> show schemas; > +---+--+ > | databaseName | > +---+--+ > | default | > | looker_scratch| > | spark_jira| > | spark_looker_scratch | > | spark_looker_test | > +---+--+ > 5 rows selected (0.61 seconds) > 0: jdbc:hive2://localhost:10006/> show tables in spark_looker_test; > +--+--+--+ > | tableName | isTemporary | > +--+--+--+ > | all_types| false| > | order_items | false| > | orders | false| > | users| false| > +--+--+--+ > 4 rows selected (0.611 seconds) > 0: jdbc:hive2://localhost:10006/> show tables in badschema; > Error: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: > Database 'badschema' not found; (state=,code=0) > {code} > On Spark 1.6.2: > {code} > [680|12:47:26] ~/Documents/spark/spark$ bin/beeline -u > jdbc:hive2://localhost:10005/ -n hive > Connecting to jdbc:hive2://localhost:10005/ > 16/10/10 12:47:29 INFO jdbc.Utils: Supplied authorities: localhost:10005 > 16/10/10 12:47:29 INFO jdbc.Utils: Resolved authority: localhost:10005 > 16/10/10 12:47:30 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10005/ > Connected to: Spark SQL (version 1.6.2) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 1.2.1.spark2 by Apache Hive > 0: jdbc:hive2://localhost:10005/> show schemas; > ++--+ > | result | > ++--+ > | default| > | spark_jira | > | spark_looker_test | > | spark_scratch | > ++--+ > 4 rows selected (0.613 seconds) > 0: jdbc:hive2://localhost:10005/> show tables in spark_looker_test; > +--+--+--+ > | tableName | isTemporary | > +--+--+--+ > | all_types| false| > | order_items | false| > | orders | false| > | users| false| > +--+--+--+ > 4 rows selected (0.575 seconds) > 0: jdbc:hive2://localhost:10005/> show tables in badschema; > ++--+--+ > | tableName | isTemporary | > ++--+--+ > ++--+--+ > No rows selected (0.458 seconds) > {code} > [Relevant part of Hive QL > docs|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566467#comment-15566467 ] Alexander Ulanov commented on SPARK-17870: -- [`SelectKBest`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) works with "a Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores". According to what you observe, it uses pvalues for sorting of `chi2` outputs. Indeed, it is the case for all functions that return two arrays: https://github.com/scikit-learn/scikit-learn/blob/412996f/sklearn/feature_selection/univariate_selection.py#L331. Alternative, one case use raw `chi2` scores for sorting. She need to pass only the first array from `chi2` to `SelectKBest`. As far as I remember, using raw chi2 scores is default in Weka's [ChiSquaredAttributeEval](http://weka.sourceforge.net/doc.stable/weka/attributeSelection/ChiSquaredAttributeEval.html). So, I would not claim that either of approaches is incorrect. According to [Introduction to IR](http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html), there might be an issue with computing p-values because then chi2-test is used multiple times. Using plain chi2 values does not involve statistical test, so it might be treated as just some ranking with no statistical implications. > ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong > > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()
[ https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566457#comment-15566457 ] Shivaram Venkataraman commented on SPARK-17781: --- [~falaki] I looked at this a bit more today and it looks like this problem is specific to dapplyCollect -- what happens here is that we dont have the schema for the output table, so we serialize it as a byteArray [1] and rely on the driver to do the conversion / deserialization while running collect. I couldn't trace this part to the end, but it looks like this gets deserialized in [2] and the call to unserialize there interprets the bytes as double instead of date. I'm not sure what is a good fix for this as well. [1] https://github.com/apache/spark/blob/master/R/pkg/inst/worker/worker.R#L75 [2] https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L1431 > datetime is serialized as double inside dapply() > > > Key: SPARK-17781 > URL: https://issues.apache.org/jira/browse/SPARK-17781 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > When we ship a SparkDataFrame to workers for dapply family functions, inside > the worker DateTime objects are serialized as double. > To reproduce: > {code} > df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date())) > dapplyCollect(df, function(x) { return(x$date) }) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11784) enable Timestamp filter pushdown
[ https://issues.apache.org/jira/browse/SPARK-11784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566439#comment-15566439 ] Ian commented on SPARK-11784: - Yes, I meant TimestampType filter pushdown > enable Timestamp filter pushdown > > > Key: SPARK-11784 > URL: https://issues.apache.org/jira/browse/SPARK-11784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Ian > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4411) Add "kill" link for jobs in the UI
[ https://issues.apache.org/jira/browse/SPARK-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566430#comment-15566430 ] Alex Bozarth commented on SPARK-4411: - I'm currently working on this. I'm updating the original pr to work with the latest code and to match how the kill stage code works > Add "kill" link for jobs in the UI > -- > > Key: SPARK-4411 > URL: https://issues.apache.org/jira/browse/SPARK-4411 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 1.2.0 >Reporter: Kay Ousterhout > > SPARK-4145 changes the default landing page for the UI to show jobs. We > should have a "kill" link for each job, similar to what we have for each > stage, so it's easier for users to kill slow jobs (and the semantics of > killing a job are slightly different than killing a stage). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566427#comment-15566427 ] Harish edited comment on SPARK-17463 at 10/11/16 8:06 PM: -- No i dont have any code like that. I use pyspark .. Please find my code snippet df1 with 60 columns (70M records) df2 with 3000-7000 (varies) columns (10M) join df1 and df2 with key columns (please note df1 is more granular data and df2 one level above. So data set will grow df3 = df1.join(df2, [keys]) aggList = [func.mean(col).alias(col + '_m') for col in df2.columns] Last part is i do -- df4 = df3.groupBy(keys).agg(*aggList) --I applying mean to each column of the df3 data frame which might be 3000-1 columns. Let me know if you need entire stack trace of this issue. PS: We still have issue https://issues.apache.org/jira/browse/SPARK-16845 -- So i have to break number of columns 500 chunks was (Author: harishk15): No i dont have any code like that. I use pyspark .. Please find my code snippet df1 with 60 columns (70M records) df2 with 3000-7000 (varies) columns (10M) join df1 and df2 with key columns (please note df1 is more granular data and df2 one level above. So data set will grow df3 = df1.join(df2, [keys]) aggList = [func.mean(col).alias(col + '_m') for col in df2.columns] Last part is i do -- df4 = df3.groupBy(keys).agg(*aggList) --I applying mean to each column of the df3 data frame which might be 3000-1 columns. PS: We still have issue https://issues.apache.org/jira/browse/SPARK-16845 -- So i have to break number of columns 500 chunks > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObjec
[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566427#comment-15566427 ] Harish commented on SPARK-17463: No i dont have any code like that. I use pyspark .. Please find my code snippet df1 with 60 columns (70M records) df2 with 3000-7000 (varies) columns (10M) join df1 and df2 with key columns (please note df1 is more granular data and df2 one level above. So data set will grow df3 = df1.join(df2, [keys]) aggList = [func.mean(col).alias(col + '_m') for col in df2.columns] Last part is i do -- df4 = df3.groupBy(keys).agg(*aggList) --I applying mean to each column of the df3 data frame which might be 3000-1 columns. PS: We still have issue https://issues.apache.org/jira/browse/SPARK-16845 -- So i have to break number of columns 500 chunks > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > jav
[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566349#comment-15566349 ] Jerome Scheuring edited comment on SPARK-12216 at 10/11/16 7:59 PM: _Note that I am entirely new to the process of submitting issues on this system: if this needs to be a new issue, I would appreciate someone letting me know._ A bug very similar to this one is 100% reproducible across multiple machines, running both Windows 8.1 and Windows 10, compiled with Scala 2.11 and running under Spark 2.0.1. It occurs * in Scala, but not Python (have not tried R) * only when reading CSV files (and not, for example, when reading Parquet files) * only when running local, not submitted to a cluster _Update:_ The bug also does not occur when run on the installation of Spark 2.0.1 on the Windows 10 machine running inside "Bash on Ubuntu on Windows", i.e. the Linux subsystem running on the Windows 10 machine where the bug _does_ occur when the program is executed from Windows. This program will produce the bug (if {{poemData}} is defined per the commented-out section, rather than being read from a CSV file, the bug does not occur): {code} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ object SparkBugDemo { def main(args: Array[String]): Unit = { val poemSchema = StructType( Seq( StructField("label",IntegerType), StructField("line",StringType) ) ) val sparkSession = SparkSession.builder() .appName("Spark Bug Demonstration") .master("local[*]") .getOrCreate() //val poemData = sparkSession.createDataFrame(Seq( // (0, "There's many a strong farmer"), // (0, "Who's heart would break in two"), // (1, "If he could see the townland"), // (1, "That we are riding to;") //)).toDF("label", "line") val poemData = sparkSession.read .option("quote", value="") .schema(poemSchema) .csv(args(0)) println(s"Record count: ${poemData.count()}") } } {code} Assuming that {{args(0)}} contains the path to a file with comma-separated integer/string pairs, as in: {noformat} 0,There's many a strong farmer 0,Who's heart would break in two 1,If he could see the townland 1,That we are riding to; {noformat} was (Author: jerome.scheuring): _Note that I am entirely new to the process of submitting issues on this system: if this needs to be a new issue, I would appreciate someone letting me know._ A bug very similar to this one is 100% reproducible across multiple machines, running both Windows 8.1 and Windows 10, compiled with Scala 2.11 and running under Spark 2.0.1. It occurs * in Scala, but not Python (have not tried R) * only when reading CSV files (and not, for example, when reading Parquet files) * only when running local, not submitted to a cluster This program will produce the bug (if {{poemData}} is defined per the commented-out section, rather than being read from a CSV file, the bug does not occur): {code} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ object SparkBugDemo { def main(args: Array[String]): Unit = { val poemSchema = StructType( Seq( StructField("label",IntegerType), StructField("line",StringType) ) ) val sparkSession = SparkSession.builder() .appName("Spark Bug Demonstration") .master("local[*]") .getOrCreate() //val poemData = sparkSession.createDataFrame(Seq( // (0, "There's many a strong farmer"), // (0, "Who's heart would break in two"), // (1, "If he could see the townland"), // (1, "That we are riding to;") //)).toDF("label", "line") val poemData = sparkSession.read .option("quote", value="") .schema(poemSchema) .csv(args(0)) println(s"Record count: ${poemData.count()}") } } {code} Assuming that {{args(0)}} contains the path to a file with comma-separated integer/string pairs, as in: {noformat} 0,There's many a strong farmer 0,Who's heart would break in two 1,If he could see the townland 1,That we are riding to; {noformat} > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Rep
[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph
[ https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Pivovarov updated SPARK-17877: Description: The following code demonstrates the issue {code} import org.apache.spark.graphx._ val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L -> "kelly")) val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) sc.setCheckpointDir("/tmp/check") val g = Graph(users, rel) g.checkpoint // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears val gg = g.connectedComponents gg.checkpoint gg.vertices.collect gg.edges.collect gg.isCheckpointed // res5: Boolean = false, /tmp/check/ contains only 1 folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f {code} I think the last line should return true instead of false was: The following code demonstrates the issue {code} import org.apache.spark.graphx._ val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L -> "kelly")) val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) sc.setCheckpointDir("/tmp/check") val g = Graph(users, rel) g.checkpoint val gg = g.connectedComponents gg.checkpoint gg.vertices.collect gg.edges.collect gg.isCheckpointed // res5: Boolean = false {code} I think the last line should return true instead of false > Can not checkpoint connectedComponents resulting graph > -- > > Key: SPARK-17877 > URL: https://issues.apache.org/jira/browse/SPARK-17877 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.5.2, 1.6.2, 2.0.1 >Reporter: Alexander Pivovarov >Priority: Minor > > The following code demonstrates the issue > {code} > import org.apache.spark.graphx._ > val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L > -> "kelly")) > val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, > "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) > sc.setCheckpointDir("/tmp/check") > val g = Graph(users, rel) > g.checkpoint // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears > val gg = g.connectedComponents > gg.checkpoint > gg.vertices.collect > gg.edges.collect > gg.isCheckpointed // res5: Boolean = false, /tmp/check/ contains only 1 > folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f > {code} > I think the last line should return true instead of false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566413#comment-15566413 ] Harish commented on SPARK-17463: Ok. thanks for the update. Do we have any work around for the second part of the issue? I tried this set("spark.rpc.netty.dispatcher.numThreads","2") but no luck > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) >
[jira] [Updated] (SPARK-17816) Json serialzation of accumulators are failing with ConcurrentModificationException
[ https://issues.apache.org/jira/browse/SPARK-17816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-17816: - Fix Version/s: 2.0.2 > Json serialzation of accumulators are failing with > ConcurrentModificationException > -- > > Key: SPARK-17816 > URL: https://issues.apache.org/jira/browse/SPARK-17816 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.1.0 >Reporter: Ergin Seyfe >Assignee: Ergin Seyfe > Fix For: 2.0.2, 2.1.0 > > > This is the stack trace: See {{ConcurrentModificationException}}: > {code} > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183) > at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45) > at scala.collection.TraversableLike$class.to(TraversableLike.scala:590) > at scala.collection.AbstractTraversable.to(Traversable.scala:104) > at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294) > at scala.collection.AbstractTraversable.toList(Traversable.scala:104) > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283) > at > org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) > at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283) > at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:137) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:157) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) > at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1244) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17816) Json serialzation of accumulators are failing with ConcurrentModificationException
[ https://issues.apache.org/jira/browse/SPARK-17816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-17816: - Affects Version/s: 2.0.1 > Json serialzation of accumulators are failing with > ConcurrentModificationException > -- > > Key: SPARK-17816 > URL: https://issues.apache.org/jira/browse/SPARK-17816 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.1.0 >Reporter: Ergin Seyfe >Assignee: Ergin Seyfe > Fix For: 2.0.2, 2.1.0 > > > This is the stack trace: See {{ConcurrentModificationException}}: > {code} > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183) > at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45) > at scala.collection.TraversableLike$class.to(TraversableLike.scala:590) > at scala.collection.AbstractTraversable.to(Traversable.scala:104) > at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294) > at scala.collection.AbstractTraversable.toList(Traversable.scala:104) > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283) > at > org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) > at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283) > at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:137) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:157) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) > at > org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35) > at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1244) > at > org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17877) Can not checkpoint connectedComponents resulting graph
[ https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566385#comment-15566385 ] Alexander Pivovarov edited comment on SPARK-17877 at 10/11/16 7:50 PM: --- Another open issues related to checkpointing are SPARK-14804 and SPARK-12431 was (Author: apivovarov): Another open issue with checkpointing is SPARK-14804 > Can not checkpoint connectedComponents resulting graph > -- > > Key: SPARK-17877 > URL: https://issues.apache.org/jira/browse/SPARK-17877 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.5.2, 1.6.2, 2.0.1 >Reporter: Alexander Pivovarov >Priority: Minor > > The following code demonstrates the issue > {code} > import org.apache.spark.graphx._ > val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L > -> "kelly")) > val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, > "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) > sc.setCheckpointDir("/tmp/check") > val g = Graph(users, rel) > g.checkpoint > val gg = g.connectedComponents > gg.checkpoint > gg.vertices.collect > gg.edges.collect > gg.isCheckpointed // res5: Boolean = false > {code} > I think the last line should return true instead of false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17812) More granular control of starting offsets (assign)
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17812: - Summary: More granular control of starting offsets (assign) (was: More granular control of starting offsets) > More granular control of starting offsets (assign) > -- > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17812) More granular control of starting offsets
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566392#comment-15566392 ] Michael Armbrust commented on SPARK-17812: -- For the seeking back {{X}} offsets use case, I was interactively querying a stream and I wanted *some* data, but not *all available data*. I did not have specific offsets in mind, and under the assumption that items are getting hashed across partitions, X offsets back is a very reasonable proxy for time. I agree actual time would be better. However, since there is disagreement on this case, I'd propose we break that out into its own ticket and focus on assign here. I'm not sure I understand the concern with the {{startingOffsets}} option naming (which we can still change, though, it would be nice to do so before a release happens). It affects which offsets will be included in the query and it only takes affect when the query is first started. [~ofirm], currently we support (1) (though I wouldn't say *all* data as we are limited by retention / compaction) and (2). As you said, we can also support (3), though this must be done after the fact by adding a predicate to the stream on the timestamp column. For performance it would be nice to push that down into Kafaka, but I'd split that optimization into another ticket. Regarding (4), I like the proposed JSON solution. It would be nice if this was unified with whatever format we decide to use in [SPARK-17829] so that you could easily pick up where another query left off. I'd also suggest we use {{-1}} and {{-2}} as special offsets for subscribing to a topicpartition at the earliest or latests available offsets at query start time. > More granular control of starting offsets > - > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek back {{X}} offsets in the stream from the moment the query starts > - seek to user specified offsets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566391#comment-15566391 ] Shixiong Zhu commented on SPARK-17463: -- Do you have a reproducer? I saw `at java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)` in the stack trace, so I think the internal ArrayList is accessed in some place. Did you use `collectionAccumulator` in your codes? FYI, https://github.com/apache/spark/pull/15371 is for SPARK-17816 which fixes an issue in driver. > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialDat
[jira] [Updated] (SPARK-17812) More granular control of starting offsets
[ https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17812: - Description: Right now you can only run a Streaming Query starting from either the earliest or latests offsets available at the moment the query is started. Sometimes this is a lot of data. It would be nice to be able to do the following: - seek to user specified offsets for manually specified topicpartitions was: Right now you can only run a Streaming Query starting from either the earliest or latests offsets available at the moment the query is started. Sometimes this is a lot of data. It would be nice to be able to do the following: - seek back {{X}} offsets in the stream from the moment the query starts - seek to user specified offsets > More granular control of starting offsets > - > > Key: SPARK-17812 > URL: https://issues.apache.org/jira/browse/SPARK-17812 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > Right now you can only run a Streaming Query starting from either the > earliest or latests offsets available at the moment the query is started. > Sometimes this is a lot of data. It would be nice to be able to do the > following: > - seek to user specified offsets for manually specified topicpartitions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17877) Can not checkpoint connectedComponents resulting graph
[ https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566385#comment-15566385 ] Alexander Pivovarov commented on SPARK-17877: - Another open issue with checkpointing is SPARK-14804 > Can not checkpoint connectedComponents resulting graph > -- > > Key: SPARK-17877 > URL: https://issues.apache.org/jira/browse/SPARK-17877 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.5.2, 1.6.2, 2.0.1 >Reporter: Alexander Pivovarov >Priority: Minor > > The following code demonstrates the issue > {code} > import org.apache.spark.graphx._ > val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L > -> "kelly")) > val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, > "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) > sc.setCheckpointDir("/tmp/check") > val g = Graph(users, rel) > g.checkpoint > val gg = g.connectedComponents > gg.checkpoint > gg.vertices.collect > gg.edges.collect > gg.isCheckpointed // res5: Boolean = false > {code} > I think the last line should return true instead of false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17845) Improve window function frame boundary API in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566380#comment-15566380 ] Timothy Hunter commented on SPARK-17845: I like the {{Window.rowsBetween(Long.MinValue, -3)}} syntax, but it is exposing a system implementation detail. How about having some static/singleton values that define our notion of plus/minus infinity instead of relying on the system values? Here is a suggestion: {code} Window.rowsBetween(Window.unboundedBefore, -3) object Window { def unboundedBefore: Long = Int.MinVal.toLong } {code} To get around that various sizes of the ints for each language, I suggest we say that every value above 2^31 is considered unbounded above. That should be more than enough and cover at least python, scala, R, java. > Improve window function frame boundary API in DataFrame > --- > > Key: SPARK-17845 > URL: https://issues.apache.org/jira/browse/SPARK-17845 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > ANSI SQL uses the following to specify the frame boundaries for window > functions: > {code} > ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > {code} > In Spark's DataFrame API, we use integer values to indicate relative position: > - 0 means "CURRENT ROW" > - -1 means "1 PRECEDING" > - Long.MinValue means "UNBOUNDED PRECEDING" > - Long.MaxValue to indicate "UNBOUNDED FOLLOWING" > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetween(Long.MinValue, -3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetween(Long.MinValue, 0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetween(0, Long.MaxValue) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetween(Long.MinValue, Long.MaxValue) > {code} > I think using numeric values to indicate relative positions is actually a > good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate > unbounded ends is pretty confusing: > 1. The API is not self-evident. There is no way for a new user to figure out > how to indicate an unbounded frame by looking at just the API. The user has > to read the doc to figure this out. > 2. It is weird Long.MinValue or Long.MaxValue has some special meaning. > 3. Different languages have different min/max values, e.g. in Python we use > -sys.maxsize and +sys.maxsize. > To make this API less confusing, we have a few options: > Option 1. Add the following (additional) methods: > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) // this one exists already > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetweenUnboundedPrecedingAnd(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetweenUnboundedPrecedingAndCurrentRow() > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetweenCurrentRowAndUnboundedFollowing() > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing() > {code} > This is obviously very verbose, but is very similar to how these functions > are done in SQL, and is perhaps the most obvious to end users, especially if > they come from SQL background. > Option 2. Decouple the specification for frame begin and frame end into two > functions. Assume the boundary is unlimited unless specified. > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsFrom(-3).rowsTo(3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsTo(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsToCurrent() or Window.rowsTo(0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsFromCurrent() or Window.rowsFrom(0) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > // no need to specify > {code} > If we go with option 2, we should throw exceptions if users specify multiple > from's or to's. A variant of option 2 is to require explicitly specification > of begin/end even in the case of unbounded boundary, e.g.: > {code} > Window.rowsFromBeginning().rowsTo(-3) > or > Window.rowsFromUnboundedPreceding().rowsTo(-3) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15153) SparkR spark.naiveBayes throws error when label is numeric type
[ https://issues.apache.org/jira/browse/SPARK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-15153. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15431 [https://github.com/apache/spark/pull/15431] > SparkR spark.naiveBayes throws error when label is numeric type > --- > > Key: SPARK-15153 > URL: https://issues.apache.org/jira/browse/SPARK-15153 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 2.1.0 > > > When the label of dataset is numeric type, SparkR spark.naiveBayes will throw > error. This bug is easy to reproduce: > {code} > t <- as.data.frame(Titanic) > t1 <- t[t$Freq > 0, -5] > t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1) > t2 <- t1[-4] > df <- suppressWarnings(createDataFrame(sqlContext, t2)) > m <- spark.naiveBayes(df, NumericSurvived ~ .) > 16/05/05 03:26:17 ERROR RBackendHandler: fit on > org.apache.spark.ml.r.NaiveBayesWrapper failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.ClassCastException: > org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to > org.apache.spark.ml.attribute.NominalAttribute > at > org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66) > at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at io.netty.channel.AbstractChannelHandlerContext.invo > {code} > In RFormula, the response variable type could be string or numeric. If it's > string, RFormula will transform it to label of DoubleType by StringIndexer > and set corresponding column metadata; otherwise, RFormula will directly use > it as label when training model (and assumes that it was numbered from 0, > ..., maxLabelIndex). > When we extract labels at ml.r.NaiveBayesWrapper, we should handle it > according the type of the response variable (string or numeric). > cc [~mengxr] [~josephkb] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph
[ https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Pivovarov updated SPARK-17877: Description: The following code demonstrates the issue {code} import org.apache.spark.graphx._ val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L -> "kelly")) val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) sc.setCheckpointDir("/tmp/check") val g = Graph(users, rel) g.checkpoint val gg = g.connectedComponents gg.checkpoint gg.vertices.collect gg.edges.collect gg.isCheckpointed // res5: Boolean = false {code} I think the last line should return true instead of false was: The following code demonstrates the issue {code} import org.apache.spark.graphx._ val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L -> "kelly")) val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) sc.setCheckpointDir("/tmp/check") val g = Graph(users, rel) g.checkpoint val gg = g.connectedComponents gg.checkpoint gg.vertices.collect gg.edges.collect gg.isCheckpointed // res5: Boolean = false {code} I think the last line should return true instead of false > Can not checkpoint connectedComponents resulting graph > -- > > Key: SPARK-17877 > URL: https://issues.apache.org/jira/browse/SPARK-17877 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.5.2, 1.6.2, 2.0.1 >Reporter: Alexander Pivovarov >Priority: Minor > > The following code demonstrates the issue > {code} > import org.apache.spark.graphx._ > val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L > -> "kelly")) > val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, > "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) > sc.setCheckpointDir("/tmp/check") > val g = Graph(users, rel) > g.checkpoint > val gg = g.connectedComponents > gg.checkpoint > gg.vertices.collect > gg.edges.collect > gg.isCheckpointed // res5: Boolean = false > {code} > I think the last line should return true instead of false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph
[ https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Pivovarov updated SPARK-17877: Description: The following code demonstrates the issue {code} import org.apache.spark.graphx._ val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L -> "kelly")) val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) sc.setCheckpointDir("/tmp/check") val g = Graph(users, rel) g.checkpoint val gg = g.connectedComponents gg.checkpoint gg.vertices.collect gg.edges.collect gg.isCheckpointed // res5: Boolean = false {code} I think the last line should return true instead of false was: The following code demonstrates the issue {code} import org.apache.spark.graphx._ val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", "postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof" val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) sc.setCheckpointDir("/tmp/check") val g = Graph(users, rel) g.checkpoint val gg = g.connectedComponents() gg.checkpoint gg.vertices.collect gg.edges.collect gg.isCheckpointed // res5: Boolean = false {code} I think the last line should return true instead of false > Can not checkpoint connectedComponents resulting graph > -- > > Key: SPARK-17877 > URL: https://issues.apache.org/jira/browse/SPARK-17877 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.5.2, 1.6.2, 2.0.1 >Reporter: Alexander Pivovarov >Priority: Minor > > The following code demonstrates the issue > {code} > import org.apache.spark.graphx._ > val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L > -> "kelly")) > val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, > "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) > sc.setCheckpointDir("/tmp/check") > val g = Graph(users, rel) > g.checkpoint > val gg = g.connectedComponents > gg.checkpoint > gg.vertices.collect > gg.edges.collect > gg.isCheckpointed > // res5: Boolean = false > {code} > I think the last line should return true instead of false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17877) Can not checkpoint connectedComponents resulting graph
Alexander Pivovarov created SPARK-17877: --- Summary: Can not checkpoint connectedComponents resulting graph Key: SPARK-17877 URL: https://issues.apache.org/jira/browse/SPARK-17877 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 2.0.1, 1.6.2, 1.5.2 Reporter: Alexander Pivovarov Priority: Minor The following code demonstrates an issue {code} import org.apache.spark.graphx._ val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", "postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof" val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) sc.setCheckpointDir("/tmp/check") val g = Graph(users, rel) g.checkpoint val gg = g.connectedComponents() gg.checkpoint gg.vertices.collect gg.edges.collect gg.isCheckpointed // res5: Boolean = false ``` I think the last line should return true instead of false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph
[ https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Pivovarov updated SPARK-17877: Description: The following code demonstrates the issue {code} import org.apache.spark.graphx._ val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", "postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof" val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) sc.setCheckpointDir("/tmp/check") val g = Graph(users, rel) g.checkpoint val gg = g.connectedComponents() gg.checkpoint gg.vertices.collect gg.edges.collect gg.isCheckpointed // res5: Boolean = false {code} I think the last line should return true instead of false was: The following code demonstrates an issue {code} import org.apache.spark.graphx._ val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", "postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof" val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) sc.setCheckpointDir("/tmp/check") val g = Graph(users, rel) g.checkpoint val gg = g.connectedComponents() gg.checkpoint gg.vertices.collect gg.edges.collect gg.isCheckpointed // res5: Boolean = false {code} I think the last line should return true instead of false > Can not checkpoint connectedComponents resulting graph > -- > > Key: SPARK-17877 > URL: https://issues.apache.org/jira/browse/SPARK-17877 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.5.2, 1.6.2, 2.0.1 >Reporter: Alexander Pivovarov >Priority: Minor > > The following code demonstrates the issue > {code} > import org.apache.spark.graphx._ > val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", > "postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof" > val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, > "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) > sc.setCheckpointDir("/tmp/check") > val g = Graph(users, rel) > g.checkpoint > val gg = g.connectedComponents() > gg.checkpoint > gg.vertices.collect > gg.edges.collect > gg.isCheckpointed > // res5: Boolean = false > {code} > I think the last line should return true instead of false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph
[ https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Pivovarov updated SPARK-17877: Description: The following code demonstrates an issue {code} import org.apache.spark.graphx._ val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", "postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof" val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) sc.setCheckpointDir("/tmp/check") val g = Graph(users, rel) g.checkpoint val gg = g.connectedComponents() gg.checkpoint gg.vertices.collect gg.edges.collect gg.isCheckpointed // res5: Boolean = false {code} I think the last line should return true instead of false was: The following code demonstrates an issue {code} import org.apache.spark.graphx._ val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", "postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof" val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) sc.setCheckpointDir("/tmp/check") val g = Graph(users, rel) g.checkpoint val gg = g.connectedComponents() gg.checkpoint gg.vertices.collect gg.edges.collect gg.isCheckpointed // res5: Boolean = false ``` I think the last line should return true instead of false > Can not checkpoint connectedComponents resulting graph > -- > > Key: SPARK-17877 > URL: https://issues.apache.org/jira/browse/SPARK-17877 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.5.2, 1.6.2, 2.0.1 >Reporter: Alexander Pivovarov >Priority: Minor > > The following code demonstrates an issue > {code} > import org.apache.spark.graphx._ > val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", > "postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof" > val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, > "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) > sc.setCheckpointDir("/tmp/check") > val g = Graph(users, rel) > g.checkpoint > val gg = g.connectedComponents() > gg.checkpoint > gg.vertices.collect > gg.edges.collect > gg.isCheckpointed > // res5: Boolean = false > {code} > I think the last line should return true instead of false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566349#comment-15566349 ] Jerome Scheuring edited comment on SPARK-12216 at 10/11/16 7:34 PM: _Note that I am entirely new to the process of submitting issues on this system: if this needs to be a new issue, I would appreciate someone letting me know._ A bug very similar to this one is 100% reproducible across multiple machines, running both Windows 8.1 and Windows 10, compiled with Scala 2.11 and running under Spark 2.0.1. It occurs * in Scala, but not Python (have not tried R) * only when reading CSV files (and not, for example, when reading Parquet files) * only when running local, not submitted to a cluster This program will produce the bug (if {{poemData}} is defined per the commented-out section, rather than being read from a CSV file, the bug does not occur): {code} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ object SparkBugDemo { def main(args: Array[String]): Unit = { val poemSchema = StructType( Seq( StructField("label",IntegerType), StructField("line",StringType) ) ) val sparkSession = SparkSession.builder() .appName("Spark Bug Demonstration") .master("local[*]") .getOrCreate() //val poemData = sparkSession.createDataFrame(Seq( // (0, "There's many a strong farmer"), // (0, "Who's heart would break in two"), // (1, "If he could see the townland"), // (1, "That we are riding to;") //)).toDF("label", "line") val poemData = sparkSession.read .option("quote", value="") .schema(poemSchema) .csv(args(0)) println(s"Record count: ${poemData.count()}") } } {code} Assuming that {{args(0)}} contains the path to a file with comma-separated integer/string pairs, as in: {noformat} 0,There's many a strong farmer 0,Who's heart would break in two 1,If he could see the townland 1,That we are riding to; {noformat} was (Author: jerome.scheuring): _Note that I am entirely new to the process of submitting issues on this system: if this needs to be a new issue, I would appreciate someone letting me know._ A bug very similar to this one is 100% reproducible across multiple machines, running both Windows 8.1 and Windows 10. It occurs * in Scala, but not Python (have not tried R) * only when reading CSV files (and not, for example, when reading Parquet files) * only when running local, not submitted to a cluster This program will produce the bug (if {{poemData}} is defined per the commented-out section, rather than being read from a CSV file, the bug does not occur): {code} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ object SparkBugDemo { def main(args: Array[String]): Unit = { val poemSchema = StructType( Seq( StructField("label",IntegerType), StructField("line",StringType) ) ) val sparkSession = SparkSession.builder() .appName("Spark Bug Demonstration") .master("local[*]") .getOrCreate() //val poemData = sparkSession.createDataFrame(Seq( // (0, "There's many a strong farmer"), // (0, "Who's heart would break in two"), // (1, "If he could see the townland"), // (1, "That we are riding to;") //)).toDF("label", "line") val poemData = sparkSession.read .option("quote", value="") .schema(poemSchema) .csv(args(0)) println(s"Record count: ${poemData.count()}") } } {code} Assuming that {{args(0)}} contains the path to a file with comma-separated integer/string pairs, as in: {noformat} 0,There's many a strong farmer 0,Who's heart would break in two 1,If he could see the townland 1,That we are riding to; {noformat} > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority: Minor > > The mailing list archives have no obvious solution to this: > scala> :q > Stopping spark context. > 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark > temp dir: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > java.io.IOException: Fai
[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566349#comment-15566349 ] Jerome Scheuring commented on SPARK-12216: -- _Note that I am entirely new to the process of submitting issues on this system: if this needs to be a new issue, I would appreciate someone letting me know._ A bug very similar to this one is 100% reproducible across multiple machines, running both Windows 8.1 and Windows 10. It occurs * in Scala, but not Python (have not tried R) * only when reading CSV files (and not, for example, when reading Parquet files) * only when running local, not submitted to a cluster This program will produce the bug (if {{poemData}} is defined per the commented-out section, rather than being read from a CSV file, the bug does not occur): {code} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ object SparkBugDemo { def main(args: Array[String]): Unit = { val poemSchema = StructType( Seq( StructField("label",IntegerType), StructField("line",StringType) ) ) val sparkSession = SparkSession.builder() .appName("Spark Bug Demonstration") .master("local[*]") .getOrCreate() //val poemData = sparkSession.createDataFrame(Seq( // (0, "There's many a strong farmer"), // (0, "Who's heart would break in two"), // (1, "If he could see the townland"), // (1, "That we are riding to;") //)).toDF("label", "line") val poemData = sparkSession.read .option("quote", value="") .schema(poemSchema) .csv(args(0)) println(s"Record count: ${poemData.count()}") } } {code} Assuming that {{args(0)}} contains the path to a file with comma-separated integer/string pairs, as in: {noformat} 0,There's many a strong farmer 0,Who's heart would break in two 1,If he could see the townland 1,That we are riding to; {noformat} > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority: Minor > > The mailing list archives have no obvious solution to this: > scala> :q > Stopping spark context. > 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark > temp dir: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > java.io.IOException: Failed to delete: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at scala.util.Try$.apply(Try.scala:161) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) > at > org.apache.hadoop.util.ShutdownHookManag
[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566299#comment-15566299 ] Sean Owen commented on SPARK-17463: --- No, that change came after, and is part of a different JIRA that addresses another part of the same problem. It is not in 2.0.1 > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputSt
[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566300#comment-15566300 ] Sean Owen commented on SPARK-17463: --- No, that change came after, and is part of a different JIRA that addresses another part of the same problem. It is not in 2.0.1 > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputSt
[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong
[ https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566284#comment-15566284 ] Sean Owen commented on SPARK-17870: --- OK I get it, they're doing different things really. The scikit version is computing the statistic for count-valued features vs categorical label, and the Spark version is computing this for categorical features vs categorical labels. Although the number of label classes is constant in both cases, the Spark computation would depend on the number of feature classes too. Yes, it does need to be changed in Spark. > ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong > > > Key: SPARK-17870 > URL: https://issues.apache.org/jira/browse/SPARK-17870 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng >Priority: Critical > > The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala > (line 233) is wrong. > For feature selection method ChiSquareSelector, it is based on the > ChiSquareTestResult.statistic (ChiSqure value) to select the features. It > select the features with the largest ChiSqure value. But the Degree of > Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and > for different df, you cannot base on ChiSqure value to select features. > Because of the wrong method to count ChiSquare value, the feature selection > results are strange. > Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example: > If use selectKBest to select: the feature 3 will be selected. > If use selectFpr to select: feature 1 and 2 will be selected. > This is strange. > I use scikit learn to test the same data with the same parameters. > When use selectKBest to select: feature 1 will be selected. > When use selectFpr to select: feature 1 and 2 will be selected. > This result is make sense. because the df of each feature in scikit learn is > the same. > I plan to submit a PR for this problem. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15153) SparkR spark.naiveBayes throws error when label is numeric type
[ https://issues.apache.org/jira/browse/SPARK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566251#comment-15566251 ] Joseph K. Bradley commented on SPARK-15153: --- Note I'm setting the target version for 2.1, not 2.0.x, since the fix requires a public API change in the preceding PR. > SparkR spark.naiveBayes throws error when label is numeric type > --- > > Key: SPARK-15153 > URL: https://issues.apache.org/jira/browse/SPARK-15153 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > When the label of dataset is numeric type, SparkR spark.naiveBayes will throw > error. This bug is easy to reproduce: > {code} > t <- as.data.frame(Titanic) > t1 <- t[t$Freq > 0, -5] > t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1) > t2 <- t1[-4] > df <- suppressWarnings(createDataFrame(sqlContext, t2)) > m <- spark.naiveBayes(df, NumericSurvived ~ .) > 16/05/05 03:26:17 ERROR RBackendHandler: fit on > org.apache.spark.ml.r.NaiveBayesWrapper failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.ClassCastException: > org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to > org.apache.spark.ml.attribute.NominalAttribute > at > org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66) > at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at io.netty.channel.AbstractChannelHandlerContext.invo > {code} > In RFormula, the response variable type could be string or numeric. If it's > string, RFormula will transform it to label of DoubleType by StringIndexer > and set corresponding column metadata; otherwise, RFormula will directly use > it as label when training model (and assumes that it was numbered from 0, > ..., maxLabelIndex). > When we extract labels at ml.r.NaiveBayesWrapper, we should handle it > according the type of the response variable (string or numeric). > cc [~mengxr] [~josephkb] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org