[jira] [Assigned] (SPARK-12495) use true as default value for propagateNull in NewInstance
[ https://issues.apache.org/jira/browse/SPARK-12495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12495: Assignee: Apache Spark > use true as default value for propagateNull in NewInstance > -- > > Key: SPARK-12495 > URL: https://issues.apache.org/jira/browse/SPARK-12495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12495) use true as default value for propagateNull in NewInstance
[ https://issues.apache.org/jira/browse/SPARK-12495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-12495: Summary: use true as default value for propagateNull in NewInstance (was: use false as default value for propagateNull in NewInstance) > use true as default value for propagateNull in NewInstance > -- > > Key: SPARK-12495 > URL: https://issues.apache.org/jira/browse/SPARK-12495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12495) use false as default value for propagateNull in NewInstance
Wenchen Fan created SPARK-12495: --- Summary: use false as default value for propagateNull in NewInstance Key: SPARK-12495 URL: https://issues.apache.org/jira/browse/SPARK-12495 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.
[ https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069345#comment-15069345 ] Jean-Baptiste Onofré commented on SPARK-12430: -- Let me take a look on that. > Temporary folders do not get deleted after Task completes causing problems > with disk space. > --- > > Key: SPARK-12430 > URL: https://issues.apache.org/jira/browse/SPARK-12430 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1, 1.5.2 > Environment: Ubuntu server >Reporter: Fede Bar > > We are experiencing an issue with automatic /tmp folder deletion after > framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as > Spark 1.5.1) over Mesos will not delete some temporary folders causing free > disk space on server to exhaust. > Behavior of M/R job using Spark 1.4.1 over Mesos cluster: > - Launched using spark-submit on one cluster node. > - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/* , > */tmp/spark-#/blockmgr-#* > - When task is completed */tmp/spark-#/* gets deleted along with > */tmp/spark-#/blockmgr-#* sub-folder. > Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job): > - Launched using spark-submit on one cluster node. > - Following folders are created: */tmp/mesos/mesos/slaves/id** * , > */tmp/spark-***/ * ,{color:red} /tmp/blockmgr-***{color} > - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle > container folder {color:red} /tmp/blockmgr-***{color} > Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several > GB depending on the job that ran. Over time this causes disk space to become > full with consequences that we all know. > Running a shell script would probably work but it is difficult to identify > folders in use by a running M/R or stale folders. I did notice similar issues > opened by other users marked as "resolved", but none seems to exactly match > the above behavior. > I really hope someone has insights on how to fix it. > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12477) [SQL] Tungsten projection fails for null values in array fields
[ https://issues.apache.org/jira/browse/SPARK-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12477: Assignee: Pierre Borckmans (was: Apache Spark) > [SQL] Tungsten projection fails for null values in array fields > --- > > Key: SPARK-12477 > URL: https://issues.apache.org/jira/browse/SPARK-12477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Pierre Borckmans >Assignee: Pierre Borckmans > Fix For: 1.5.3, 1.6.1, 2.0.0 > > > Accessing null elements in an array field fails when tungsten is enabled. > It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled. > Example: > {code} > // Array of String > case class AS( as: Seq[String] ) > val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF > dfAS.registerTempTable("T_AS") > for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from > T_AS").collect.mkString(","))} > {code} > With Tungsten disabled: > {code} > 0 = [a] > 1 = [null] > 2 = [b] > {code} > With Tungsten enabled: > {code} > 0 = [a] > 15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > {code} > -- > More examples below. > The following code works in Spark 1.3.1, and in Spark > 1.5 with Tungsten > disabled: > {code} > // Array of String > case class AS( as: Seq[String] ) > val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF > dfAS.registerTempTable("T_AS") > for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select as[$i] from > T_AS").collect.mkString(","))} > // Array of Int > case class AI( ai: Seq[Option[Int]] ) > val dfAI = sc.parallelize( Seq( AI ( Seq(Some(1),None,Some(2) ) ) ) ).toDF > dfAI.registerTempTable("T_AI") > for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select ai[$i] from > T_AI").collect.mkString(","))} > // Array of struct[Int,String] > case class B(x: Option[Int], y: String) > case class A( b: Seq[B] ) > val df1 = sc.parallelize( Seq( A ( Seq( B(Some(1),"a"),B(Some(2),"b"), > B(None, "c"), B(Some(4),null), B(None,null), null ) ) ) ).toDF > df1.registerTempTable("T1") > val df2 = sc.parallelize( Seq( A ( Seq( B(Some(1),"a"),B(Some(2),"b"), > B(None, "c"), B(Some(4),null), B(None,null), null ) ), A(null) ) ).toDF > df2.registerTempTable("T2") > for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select b[$i].x, > b[$i].y from T1").collect.mkString(","))} > for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select b[$i].x, > b[$i].y from T2").collect.mkString(","))} > // Struct[Int,String] > case class C(b: B) > val df3 = sc.parallelize( Seq( C ( B(Some(1),"test") ), C(null) ) ).toDF > df3.registerTempTable("T3") > sqlContext.sql("select b.x, b.y from T3").collect > {code} > With Tungsten enabled, it reaches NullPointerException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12477) [SQL] Tungsten projection fails for null values in array fields
[ https://issues.apache.org/jira/browse/SPARK-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12477. - Resolution: Fixed Fix Version/s: 2.0.0 1.6.1 1.5.3 > [SQL] Tungsten projection fails for null values in array fields > --- > > Key: SPARK-12477 > URL: https://issues.apache.org/jira/browse/SPARK-12477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Pierre Borckmans >Assignee: Apache Spark > Fix For: 1.5.3, 1.6.1, 2.0.0 > > > Accessing null elements in an array field fails when tungsten is enabled. > It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled. > Example: > {code} > // Array of String > case class AS( as: Seq[String] ) > val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF > dfAS.registerTempTable("T_AS") > for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from > T_AS").collect.mkString(","))} > {code} > With Tungsten disabled: > {code} > 0 = [a] > 1 = [null] > 2 = [b] > {code} > With Tungsten enabled: > {code} > 0 = [a] > 15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > {code} > -- > More examples below. > The following code works in Spark 1.3.1, and in Spark > 1.5 with Tungsten > disabled: > {code} > // Array of String > case class AS( as: Seq[String] ) > val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF > dfAS.registerTempTable("T_AS") > for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select as[$i] from > T_AS").collect.mkString(","))} > // Array of Int > case class AI( ai: Seq[Option[Int]] ) > val dfAI = sc.parallelize( Seq( AI ( Seq(Some(1),None,Some(2) ) ) ) ).toDF > dfAI.registerTempTable("T_AI") > for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select ai[$i] from > T_AI").collect.mkString(","))} > // Array of struct[Int,String] > case class B(x: Option[Int], y: String) > case class A( b: Seq[B] ) > val df1 = sc.parallelize( Seq( A ( Seq( B(Some(1),"a"),B(Some(2),"b"), > B(None, "c"), B(Some(4),null), B(None,null), null ) ) ) ).toDF > df1.registerTempTable("T1") > val df2 = sc.parallelize( Seq( A ( Seq( B(Some(1),"a"),B(Some(2),"b"), > B(None, "c"), B(Some(4),null), B(None,null), null ) ), A(null) ) ).toDF > df2.registerTempTable("T2") > for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select b[$i].x, > b[$i].y from T1").collect.mkString(","))} > for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select b[$i].x, > b[$i].y from T2").collect.mkString(","))} > // Struct[Int,String] > case class C(b: B) > val df3 = sc.parallelize( Seq( C ( B(Some(1),"test") ), C(null) ) ).toDF > df3.registerTempTable("T3") > sqlContext.sql("select b.x, b.y from T3").collect > {code} > With Tungsten enabled, it reaches NullPointerException. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12494) Array out of bound Exception in KMeans Yarn Mode
Anandraj created SPARK-12494: Summary: Array out of bound Exception in KMeans Yarn Mode Key: SPARK-12494 URL: https://issues.apache.org/jira/browse/SPARK-12494 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.5.0 Reporter: Anandraj Priority: Blocker Hi, I am try to run k-means clustering on the word2vec data. I tested the code in local mode with small data. Clustering completes fine. But, when I run with same data on Yarn Cluster mode, it fails below error. 15/12/23 00:49:01 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.ArrayIndexOutOfBoundsException: 0 java.lang.ArrayIndexOutOfBoundsException: 0 at scala.collection.mutable.WrappedArray$ofRef.apply(WrappedArray.scala:126) at org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:377) at org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:377) at scala.Array$.tabulate(Array.scala:331) at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:377) at org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:249) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:213) at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:520) at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:531) at com.tempurer.intelligence.adhocjobs.spark.kMeans$delayedInit$body.apply(kMeans.scala:24) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) at com.tempurer.intelligence.adhocjobs.spark.kMeans$.main(kMeans.scala:9) at com.tempurer.intelligence.adhocjobs.spark.kMeans.main(kMeans.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525) 15/12/23 00:49:01 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.ArrayIndexOutOfBoundsException: 0) In Local mode with large data(2375849 vectors of size 200) , the first sampling stage completes. Second stage suspends execution without any error message. No Active execution in progress. I could only see the below warning message 15/12/23 01:24:13 INFO TaskSetManager: Finished task 9.0 in stage 1.0 (TID 37) in 29 ms on localhost (4/34) 15/12/23 01:24:14 WARN SparkContext: Requesting executors is only supported in coarse-grained mode 15/12/23 01:24:14 WARN ExecutorAllocationManager: Unable to reach the cluster manager to request 2 total executors! 15/12/23 01:24:15 WARN SparkContext: Requesting executors is only supported in coarse-grained mode 15/12/23 01:24:15 WARN ExecutorAllocationManager: Unable to reach the cluster manager to request 3 total executors! 15/12/23 01:24:16 WARN SparkContext: Requesting executors is only supported in coarse-grained mode 15/12/23 01:24:16 WARN ExecutorAllocationManager: Unable to reach the cluster manager to request 4 total executors! 15/12/23 01:24:17 WARN SparkContext: Requesting executors is only supported in coarse-grained mode 15/12/23 01:24:17 WARN ExecutorAllocationManager: Unable to reach the cluster manager to request 5 total executors! 15/12/23 01:24:18 WARN SparkContext: Requesting executors is only supported in coarse-grained mode 15/12/23 01:24:18 WARN ExecutorAllocationManager: Unable to reach the cluster manager to request 6 total executors! 15/12/23 01:24:19 WARN SparkContext: Requesting executors is only supported in coarse-grained mode 15/12/23 01:24:19 WARN ExecutorAllocationManager: Unable to reach the cluster manager to request 7 total executors! 15/12/23 01:24:20 WARN SparkContext: Requesting executors is only supported in coarse-grained mode 15/12/23 01:24:20 WARN ExecutorAllocationManager: Unable to reach the cluster manager to request 8 total executors! 15/12/23 01:24:21 WARN SparkContext: Requesting executors is only supported in coarse-grained mode 15/12/23 01:24:21 WARN ExecutorAllocationManager: Unable to reach the cluster manager to request 9 total executors! 15/12/23 01:24:22 WARN SparkContext: Requesting executors is only suppor
[jira] [Updated] (SPARK-12492) SQL page of Spark-sql is always blank
[ https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12492: Component/s: Web UI > SQL page of Spark-sql is always blank > -- > > Key: SPARK-12492 > URL: https://issues.apache.org/jira/browse/SPARK-12492 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Reporter: meiyoula > Attachments: screenshot-1.png > > > When I run a sql query in spark-sql, the Execution page of SQL tab is always > blank. But the JDBCServer is not blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12493) Can't open "details" span of ExecutionsPage in IE11
meiyoula created SPARK-12493: Summary: Can't open "details" span of ExecutionsPage in IE11 Key: SPARK-12493 URL: https://issues.apache.org/jira/browse/SPARK-12493 Project: Spark Issue Type: Bug Reporter: meiyoula Attachments: screenshot-1.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12493) Can't open "details" span of ExecutionsPage in IE11
[ https://issues.apache.org/jira/browse/SPARK-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-12493: - Component/s: Web UI > Can't open "details" span of ExecutionsPage in IE11 > --- > > Key: SPARK-12493 > URL: https://issues.apache.org/jira/browse/SPARK-12493 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: meiyoula > Attachments: screenshot-1.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12493) Can't open "details" span of ExecutionsPage in IE11
[ https://issues.apache.org/jira/browse/SPARK-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-12493: - Attachment: screenshot-1.png > Can't open "details" span of ExecutionsPage in IE11 > --- > > Key: SPARK-12493 > URL: https://issues.apache.org/jira/browse/SPARK-12493 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: meiyoula > Attachments: screenshot-1.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12492) SQL page of Spark-sql is always blank
[ https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-12492: - Attachment: screenshot-1.png > SQL page of Spark-sql is always blank > -- > > Key: SPARK-12492 > URL: https://issues.apache.org/jira/browse/SPARK-12492 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: meiyoula > Attachments: screenshot-1.png > > > When I run a sql query in spark-sql, the Execution page of SQL tab is always > blank. But the JDBCServer is not blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12492) SQL page of Spark-sql is always blank
meiyoula created SPARK-12492: Summary: SQL page of Spark-sql is always blank Key: SPARK-12492 URL: https://issues.apache.org/jira/browse/SPARK-12492 Project: Spark Issue Type: Bug Components: SQL Reporter: meiyoula When I run a sql query in spark-sql, the Execution page of SQL tab is always blank. But the JDBCServer is not blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11164) Add InSet pushdown filter back for Parquet
[ https://issues.apache.org/jira/browse/SPARK-11164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11164. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10278 [https://github.com/apache/spark/pull/10278] > Add InSet pushdown filter back for Parquet > -- > > Key: SPARK-11164 > URL: https://issues.apache.org/jira/browse/SPARK-11164 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Xiao Li > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11164) Add InSet pushdown filter back for Parquet
[ https://issues.apache.org/jira/browse/SPARK-11164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11164: --- Assignee: Xiao Li > Add InSet pushdown filter back for Parquet > -- > > Key: SPARK-11164 > URL: https://issues.apache.org/jira/browse/SPARK-11164 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Xiao Li > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12385) Push projection into Join
[ https://issues.apache.org/jira/browse/SPARK-12385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069159#comment-15069159 ] Xiao Li commented on SPARK-12385: - After the combining, I believe we can see the performance gain. Also confirmed from RDBMS people that this is normal in the RDBMS compiler. > Push projection into Join > - > > Key: SPARK-12385 > URL: https://issues.apache.org/jira/browse/SPARK-12385 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > We usually have Join followed by a projection to pruning some columns, but > Join already have a result projection to produce UnsafeRow, we should combine > them together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12385) Push projection into Join
[ https://issues.apache.org/jira/browse/SPARK-12385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069133#comment-15069133 ] Xiao Li commented on SPARK-12385: - I will work on it, if nobody starts it. Thanks! > Push projection into Join > - > > Key: SPARK-12385 > URL: https://issues.apache.org/jira/browse/SPARK-12385 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > We usually have Join followed by a projection to pruning some columns, but > Join already have a result projection to produce UnsafeRow, we should combine > them together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project
[ https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069101#comment-15069101 ] Cody Koeninger commented on SPARK-11045: Direct stream isn't doing any caching unless you specifically ask for it, in which case you can set storage level On Dec 22, 2015 9:30 PM, "Balaji Ramamoorthy (JIRA)" > Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to > Apache Spark Project > > > Key: SPARK-11045 > URL: https://issues.apache.org/jira/browse/SPARK-11045 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Dibyendu Bhattacharya > > This JIRA is to track the progress of making the Receiver based Low Level > Kafka Consumer from spark-packages > (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be > contributed back to Apache Spark Project. > This Kafka consumer has been around for more than year and has matured over > the time . I see there are many adoptions of this package . I receive > positive feedbacks that this consumer gives better performance and fault > tolerant capabilities. > This is the primary intent of this JIRA to give community a better > alternative if they want to use Receiver Base model. > If this consumer make it to Spark Core, it will definitely see more adoption > and support from community and help many who still prefer the Receiver Based > model of Kafka Consumer. > I understand the Direct Stream is the consumer which can give Exact Once > semantics and uses Kafka Low Level API , which is good . But Direct Stream > has concerns around recovering checkpoint on driver code change . Application > developer need to manage their own offset which complex . Even if some one > does manages their own offset , it limits the parallelism Spark Streaming can > achieve. If someone wants more parallelism and want > spark.streaming.concurrentJobs more than 1 , you can no longer rely on > storing offset externally as you have no control which batch will run in > which sequence. > Furthermore , the Direct Stream has higher latency , as it fetch messages > form Kafka during RDD action . Also number of RDD partitions are limited to > topic partition . So unless your Kafka topic does not have enough partitions, > you have limited parallelism while RDD processing. > Due to above mentioned concerns , many people who does not want Exactly Once > semantics , still prefer Receiver based model. Unfortunately, when customer > fall back to KafkaUtil.CreateStream approach, which use Kafka High Level > Consumer, there are other issues around the reliability of Kafka High Level > API. Kafka High Level API is buggy and has serious issue around Consumer > Re-balance. Hence I do not think this is correct to advice people to use > KafkaUtil.CreateStream in production . > The better option presently is there is to use the Consumer from > spark-packages . It is is using Kafka Low Level Consumer API , store offset > in Zookeeper, and can recover from any failure . Below are few highlights of > this consumer .. > 1. It has a inbuilt PID Controller for dynamic rate limiting. > 2. In this consumer , The Rate Limiting is done by modifying the size blocks > by controlling the size of messages pulled from Kafka. Whereas , in Spark the > Rate Limiting is done by controlling number of messages. The issue with > throttling by number of message is, if message size various, block size will > also vary . Let say your Kafka has messages for different sizes from 10KB to > 500 KB. Thus throttling by number of message can never give any deterministic > size of your block hence there is no guarantee that Memory Back-Pressure can > really take affect. > 3. This consumer is using Kafka low level API which gives better performance > than KafkaUtils.createStream based High Level API. > 4. This consumer can give end to end no data loss channel if enabled with WAL. > By accepting this low level kafka consumer from spark packages to apache > spark project , we will give community a better options for Kafka > connectivity both for Receiver less and Receiver based model. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"
[ https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12479: Assignee: Apache Spark > sparkR collect on GroupedData throws R error "missing value where > TRUE/FALSE needed" > -- > > Key: SPARK-12479 > URL: https://issues.apache.org/jira/browse/SPARK-12479 > Project: Spark > Issue Type: Bug > Components: R, SparkR >Affects Versions: 1.5.1 >Reporter: Paulo Magalhaes >Assignee: Apache Spark > > sparkR collect on GroupedData throws "missing value where TRUE/FALSE needed" > Spark Version: 1.5.1 > R Version: 3.2.2 > I tracked down the root cause of this exception to an specific key for which > the hashCode could not be calculated. > The following code recreates the problem when ran in sparkR: > hashCode <- getFromNamespace("hashCode","SparkR") > hashCode("bc53d3605e8a5b7de1e8e271c2317645") > Error in if (value > .Machine$integer.max) { : > missing value where TRUE/FALSE needed > I went one step further and relaised the the problem happens because of the > bit wise shift below returning NA. > bitwShiftL(-1073741824,1) > where bitwShiftL is an R function. > I believe the bitwShiftL function is working as it is supposed to. Therefore, > this PR fixes it in the SparkR package: > https://github.com/apache/spark/pull/10436 > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"
[ https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12479: Assignee: (was: Apache Spark) > sparkR collect on GroupedData throws R error "missing value where > TRUE/FALSE needed" > -- > > Key: SPARK-12479 > URL: https://issues.apache.org/jira/browse/SPARK-12479 > Project: Spark > Issue Type: Bug > Components: R, SparkR >Affects Versions: 1.5.1 >Reporter: Paulo Magalhaes > > sparkR collect on GroupedData throws "missing value where TRUE/FALSE needed" > Spark Version: 1.5.1 > R Version: 3.2.2 > I tracked down the root cause of this exception to an specific key for which > the hashCode could not be calculated. > The following code recreates the problem when ran in sparkR: > hashCode <- getFromNamespace("hashCode","SparkR") > hashCode("bc53d3605e8a5b7de1e8e271c2317645") > Error in if (value > .Machine$integer.max) { : > missing value where TRUE/FALSE needed > I went one step further and relaised the the problem happens because of the > bit wise shift below returning NA. > bitwShiftL(-1073741824,1) > where bitwShiftL is an R function. > I believe the bitwShiftL function is working as it is supposed to. Therefore, > this PR fixes it in the SparkR package: > https://github.com/apache/spark/pull/10436 > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"
[ https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069092#comment-15069092 ] Apache Spark commented on SPARK-12479: -- User 'paulomagalhaes' has created a pull request for this issue: https://github.com/apache/spark/pull/10436 > sparkR collect on GroupedData throws R error "missing value where > TRUE/FALSE needed" > -- > > Key: SPARK-12479 > URL: https://issues.apache.org/jira/browse/SPARK-12479 > Project: Spark > Issue Type: Bug > Components: R, SparkR >Affects Versions: 1.5.1 >Reporter: Paulo Magalhaes > > sparkR collect on GroupedData throws "missing value where TRUE/FALSE needed" > Spark Version: 1.5.1 > R Version: 3.2.2 > I tracked down the root cause of this exception to an specific key for which > the hashCode could not be calculated. > The following code recreates the problem when ran in sparkR: > hashCode <- getFromNamespace("hashCode","SparkR") > hashCode("bc53d3605e8a5b7de1e8e271c2317645") > Error in if (value > .Machine$integer.max) { : > missing value where TRUE/FALSE needed > I went one step further and relaised the the problem happens because of the > bit wise shift below returning NA. > bitwShiftL(-1073741824,1) > where bitwShiftL is an R function. > I believe the bitwShiftL function is working as it is supposed to. Therefore, > this PR fixes it in the SparkR package: > https://github.com/apache/spark/pull/10436 > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"
[ https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12479: Assignee: (was: Apache Spark) > sparkR collect on GroupedData throws R error "missing value where > TRUE/FALSE needed" > -- > > Key: SPARK-12479 > URL: https://issues.apache.org/jira/browse/SPARK-12479 > Project: Spark > Issue Type: Bug > Components: R, SparkR >Affects Versions: 1.5.1 >Reporter: Paulo Magalhaes > > sparkR collect on GroupedData throws "missing value where TRUE/FALSE needed" > Spark Version: 1.5.1 > R Version: 3.2.2 > I tracked down the root cause of this exception to an specific key for which > the hashCode could not be calculated. > The following code recreates the problem when ran in sparkR: > hashCode <- getFromNamespace("hashCode","SparkR") > hashCode("bc53d3605e8a5b7de1e8e271c2317645") > Error in if (value > .Machine$integer.max) { : > missing value where TRUE/FALSE needed > I went one step further and relaised the the problem happens because of the > bit wise shift below returning NA. > bitwShiftL(-1073741824,1) > where bitwShiftL is an R function. > I believe the bitwShiftL function is working as it is supposed to. Therefore, > this PR fixes it in the SparkR package: > https://github.com/apache/spark/pull/10436 > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"
[ https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12479: Assignee: Apache Spark > sparkR collect on GroupedData throws R error "missing value where > TRUE/FALSE needed" > -- > > Key: SPARK-12479 > URL: https://issues.apache.org/jira/browse/SPARK-12479 > Project: Spark > Issue Type: Bug > Components: R, SparkR >Affects Versions: 1.5.1 >Reporter: Paulo Magalhaes >Assignee: Apache Spark > > sparkR collect on GroupedData throws "missing value where TRUE/FALSE needed" > Spark Version: 1.5.1 > R Version: 3.2.2 > I tracked down the root cause of this exception to an specific key for which > the hashCode could not be calculated. > The following code recreates the problem when ran in sparkR: > hashCode <- getFromNamespace("hashCode","SparkR") > hashCode("bc53d3605e8a5b7de1e8e271c2317645") > Error in if (value > .Machine$integer.max) { : > missing value where TRUE/FALSE needed > I went one step further and relaised the the problem happens because of the > bit wise shift below returning NA. > bitwShiftL(-1073741824,1) > where bitwShiftL is an R function. > I believe the bitwShiftL function is working as it is supposed to. Therefore, > this PR fixes it in the SparkR package: > https://github.com/apache/spark/pull/10436 > . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project
[ https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069091#comment-15069091 ] Balaji Ramamoorthy commented on SPARK-11045: [~dibbhatt] We are looking for a high throughput kafka consumer and exactly-once is not a priority. I also could not figure out how to set the StorageLevel in KafkaDirect to MEMORY_ONLY. Since the low-level receiver based consumer supports everything i am looking for, i am curious to know how much performance improvement does it provide over KafkaDirect? Did you get a chance to do any bench-marking ? > Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to > Apache Spark Project > > > Key: SPARK-11045 > URL: https://issues.apache.org/jira/browse/SPARK-11045 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Dibyendu Bhattacharya > > This JIRA is to track the progress of making the Receiver based Low Level > Kafka Consumer from spark-packages > (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be > contributed back to Apache Spark Project. > This Kafka consumer has been around for more than year and has matured over > the time . I see there are many adoptions of this package . I receive > positive feedbacks that this consumer gives better performance and fault > tolerant capabilities. > This is the primary intent of this JIRA to give community a better > alternative if they want to use Receiver Base model. > If this consumer make it to Spark Core, it will definitely see more adoption > and support from community and help many who still prefer the Receiver Based > model of Kafka Consumer. > I understand the Direct Stream is the consumer which can give Exact Once > semantics and uses Kafka Low Level API , which is good . But Direct Stream > has concerns around recovering checkpoint on driver code change . Application > developer need to manage their own offset which complex . Even if some one > does manages their own offset , it limits the parallelism Spark Streaming can > achieve. If someone wants more parallelism and want > spark.streaming.concurrentJobs more than 1 , you can no longer rely on > storing offset externally as you have no control which batch will run in > which sequence. > Furthermore , the Direct Stream has higher latency , as it fetch messages > form Kafka during RDD action . Also number of RDD partitions are limited to > topic partition . So unless your Kafka topic does not have enough partitions, > you have limited parallelism while RDD processing. > Due to above mentioned concerns , many people who does not want Exactly Once > semantics , still prefer Receiver based model. Unfortunately, when customer > fall back to KafkaUtil.CreateStream approach, which use Kafka High Level > Consumer, there are other issues around the reliability of Kafka High Level > API. Kafka High Level API is buggy and has serious issue around Consumer > Re-balance. Hence I do not think this is correct to advice people to use > KafkaUtil.CreateStream in production . > The better option presently is there is to use the Consumer from > spark-packages . It is is using Kafka Low Level Consumer API , store offset > in Zookeeper, and can recover from any failure . Below are few highlights of > this consumer .. > 1. It has a inbuilt PID Controller for dynamic rate limiting. > 2. In this consumer , The Rate Limiting is done by modifying the size blocks > by controlling the size of messages pulled from Kafka. Whereas , in Spark the > Rate Limiting is done by controlling number of messages. The issue with > throttling by number of message is, if message size various, block size will > also vary . Let say your Kafka has messages for different sizes from 10KB to > 500 KB. Thus throttling by number of message can never give any deterministic > size of your block hence there is no guarantee that Memory Back-Pressure can > really take affect. > 3. This consumer is using Kafka low level API which gives better performance > than KafkaUtils.createStream based High Level API. > 4. This consumer can give end to end no data loss channel if enabled with WAL. > By accepting this low level kafka consumer from spark packages to apache > spark project , we will give community a better options for Kafka > connectivity both for Receiver less and Receiver based model. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12468: Assignee: Apache Spark > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Assignee: Apache Spark >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12468: Assignee: (was: Apache Spark) > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12468: Assignee: Apache Spark > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Assignee: Apache Spark >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12468: Assignee: (was: Apache Spark) > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069061#comment-15069061 ] Andrew Davidson commented on SPARK-12484: - Hi Xiao thanks for looking into this Andy > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069055#comment-15069055 ] Xiao Li commented on SPARK-12483: - Hi, Andy, See this example: {code} DataFrame df = hc.range(0, 100).unionAll(hc.range(0, 100)).select(col("id").as("value")); {code} Good luck, Xiao > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10486) Spark intermittently fails to recover from a worker failure (in standalone mode)
[ https://issues.apache.org/jira/browse/SPARK-10486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069054#comment-15069054 ] Liang Chen commented on SPARK-10486: I meet the same problem > Spark intermittently fails to recover from a worker failure (in standalone > mode) > > > Key: SPARK-10486 > URL: https://issues.apache.org/jira/browse/SPARK-10486 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Cheuk Lam >Priority: Critical > > We have run into a problem where some Spark job is aborted after one worker > is killed in a 2-worker standalone cluster. The problem is intermittent, but > we can consistently reproduce it. The problem only appears to happen when we > kill a worker. It doesn't seem to happen when we kill an executor directly. > The program we use to reproduce the problem is some iterative program based > on GraphX, although the nature of the issue doesn't seem to be GraphX > related. This is how we reproduce the problem: > * Set up a standalone cluster of 2 workers; > * Run a Spark application of some iterative program (ours is some based on > GraphX); > * Kill a worker process (and thus the associated executor); > * Intermittently some job will be aborted. > The driver and the executor logs are available, as well as the application > history (event log file). But they are quite large and can't be attached > here. > ~ > After looking into the log files, we think the failure is caused by the > following two things combined: > * The BlockManagerMasterEndpoint in the driver has some stale block info > corresponding to the dead executor after the worker has been killed. The > driver does appear to handle the "RemoveExecutor" message and cleans up all > related block info. But subsequently, and intermittently, it receives some > Akka messages to re-register the dead BlockManager and re-add some of its > blocks. As a result, upon GetLocations requests from the remaining executor, > the driver responds with some stale block info, instructing the remaining > executor to fetch blocks from the dead executor. Please see the driver log > excerption below that shows the sequence of events described above. In the > log, there are two executors: 1.2.3.4 was the one which got shut down, while > 5.6.7.8 is the remaining executor. The driver also ran on 5.6.7.8. > * When the remaining executor's BlockManager issues a doGetRemote() call to > fetch the block of data, it fails because the targeted BlockManager which > resided in the dead executor is gone. This failure results in an exception > forwarded to the caller, bypassing the mechanism in the doGetRemote() > function to trigger a re-computation of the block. I don't know whether that > is intentional or not. > Driver log excerption that shows that the driver received messages to > re-register the dead executor after handling the RemoveExecutor message: > 11690 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG > AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message > (172.236378 ms) > AkkaMessage(RegisterExecutor(0,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/Executor#670388190]),1.2.3.4:36140,8,Map(stdout > -> > http://1.2.3.4:8081/logPage/?appId=app-20150902203512-&executorId=0&logType=stdout, > stderr -> > http://1.2.3.4:8081/logPage/?appId=app-20150902203512-&executorId=0&logType=stderr)),true) > from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$f] > 11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG > AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message > AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, > 52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true) > from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$g] > 11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG > AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: > AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, > 52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true) > 11718 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] INFO > BlockManagerMasterEndpoint: Registering block manager 1.2.3.4:52615 with 6.2 > GB RAM, BlockManagerId(0, 1.2.3.4, 52615) > 11719 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG > AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message > (1.498313 ms) AkkaMessage(Regi
[jira] [Commented] (SPARK-12478) Dataset fields of product types can't be null
[ https://issues.apache.org/jira/browse/SPARK-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069044#comment-15069044 ] Cheng Lian commented on SPARK-12478: I'm leaving this ticket open since we also need to backport this to branch-1.6 after the release. > Dataset fields of product types can't be null > - > > Key: SPARK-12478 > URL: https://issues.apache.org/jira/browse/SPARK-12478 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > Labels: backport-needed > > Spark shell snippet for reproduction: > {code} > import sqlContext.implicits._ > case class Inner(f: Int) > case class Outer(i: Inner) > Seq(Outer(null)).toDS().toDF().show() > Seq(Outer(null)).toDS().show() > {code} > Expected output should be: > {noformat} > ++ > | i| > ++ > |null| > ++ > ++ > | i| > ++ > |null| > ++ > {noformat} > Actual output: > {noformat} > +--+ > | i| > +--+ > |[null]| > +--+ > java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: > Null value appeared in non-nullable field Inner.f of type scala.Int. If the > schema is inferred from a Scala tuple/case class, or a Java bean, please try > to use scala.Option[_] or other nullable types (e.g. java.lang.Integer > instead of int/scala.Int). > newinstance(class $iwC$$iwC$Outer,if (isnull(input[0, > StructType(StructField(f,IntegerType,false))])) null else newinstance(class > $iwC$$iwC$Inner,assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class > $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)),false,ObjectType(class > $iwC$$iwC$Outer),Some($iwC$$iwC@6ab35ce3)) > +- if (isnull(input[0, StructType(StructField(f,IntegerType,false))])) null > else newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class > $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)) >:- isnull(input[0, StructType(StructField(f,IntegerType,false))]) >: +- input[0, StructType(StructField(f,IntegerType,false))] >:- null >+- newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class > $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)) > +- assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int) > +- input[0, StructType(StructField(f,IntegerType,false))].f > +- input[0, StructType(StructField(f,IntegerType,false))] > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:224) > at > org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704) > at > org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:704) > at org.apache.spark.sql.Dataset.take(Dataset.scala:725) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:240) > at org.apache.spark.sql.Dataset.show(Dataset.scala:230) > at org.apache.spark.sql.Dataset.show(Dataset.scala:193) > at org.apache.spark.sql.Dataset.show(Dataset.scala:201) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:46) > at $iwC$$iwC$$iwC$$iwC.(:48) > at $iwC$$iwC$$iwC.(:50) > at $iwC$$iwC.(:52) > at $iwC.(:54) > at (:56) > at .(:60) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > a
[jira] [Updated] (SPARK-12478) Dataset fields of product types can't be null
[ https://issues.apache.org/jira/browse/SPARK-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-12478: --- Labels: backport-needed (was: ) > Dataset fields of product types can't be null > - > > Key: SPARK-12478 > URL: https://issues.apache.org/jira/browse/SPARK-12478 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > Labels: backport-needed > > Spark shell snippet for reproduction: > {code} > import sqlContext.implicits._ > case class Inner(f: Int) > case class Outer(i: Inner) > Seq(Outer(null)).toDS().toDF().show() > Seq(Outer(null)).toDS().show() > {code} > Expected output should be: > {noformat} > ++ > | i| > ++ > |null| > ++ > ++ > | i| > ++ > |null| > ++ > {noformat} > Actual output: > {noformat} > +--+ > | i| > +--+ > |[null]| > +--+ > java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: > Null value appeared in non-nullable field Inner.f of type scala.Int. If the > schema is inferred from a Scala tuple/case class, or a Java bean, please try > to use scala.Option[_] or other nullable types (e.g. java.lang.Integer > instead of int/scala.Int). > newinstance(class $iwC$$iwC$Outer,if (isnull(input[0, > StructType(StructField(f,IntegerType,false))])) null else newinstance(class > $iwC$$iwC$Inner,assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class > $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)),false,ObjectType(class > $iwC$$iwC$Outer),Some($iwC$$iwC@6ab35ce3)) > +- if (isnull(input[0, StructType(StructField(f,IntegerType,false))])) null > else newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class > $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)) >:- isnull(input[0, StructType(StructField(f,IntegerType,false))]) >: +- input[0, StructType(StructField(f,IntegerType,false))] >:- null >+- newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class > $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)) > +- assertnotnull(input[0, > StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int) > +- input[0, StructType(StructField(f,IntegerType,false))].f > +- input[0, StructType(StructField(f,IntegerType,false))] > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:224) > at > org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704) > at > org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:704) > at org.apache.spark.sql.Dataset.take(Dataset.scala:725) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:240) > at org.apache.spark.sql.Dataset.show(Dataset.scala:230) > at org.apache.spark.sql.Dataset.show(Dataset.scala:193) > at org.apache.spark.sql.Dataset.show(Dataset.scala:201) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:46) > at $iwC$$iwC$$iwC$$iwC.(:48) > at $iwC$$iwC$$iwC.(:50) > at $iwC$$iwC.(:52) > at $iwC.(:54) > at (:56) > at .(:60) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1045) > at > org.apache.spark.repl.Sp
[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069027#comment-15069027 ] Andrew Davidson commented on SPARK-12483: - Hi Xiao thanks for looking at the issue is there a way to change a column name? If you do a select() using a data frame, the column name is really strange see attachement for https://issues.apache.org/jira/browse/SPARK-12484 // get column from data frame call df.withColumnName Column newCol = udfDF.col("_c0"); renaming data frame columns is very common in R Kind regards Andy > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069025#comment-15069025 ] Xiao Li commented on SPARK-12484: - Please check the email answer by [~zjffdu] Thanks! > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069020#comment-15069020 ] Xiao Li commented on SPARK-12483: - That means, it is not a bug. Thanks! > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069019#comment-15069019 ] Xiao Li commented on SPARK-12483: - {code}as{code} is used to return a new DataFrame with an alias set. For example, {code} val x = testData2.as("x") val y = testData2.as("y") val join = x.join(y, $"x.a" === $"y.a", "inner").queryExecution.optimizedPlan {code} > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12220) Make Utils.fetchFile support files that contain special characters
[ https://issues.apache.org/jira/browse/SPARK-12220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12220: - Fix Version/s: (was: 1.6.1) (was: 2.0.0) 1.6.0 > Make Utils.fetchFile support files that contain special characters > -- > > Key: SPARK-12220 > URL: https://issues.apache.org/jira/browse/SPARK-12220 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > > Now if a file name contains some illegal characters, such as " ", > Utils.fetchFile will fail because it doesn't handle this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11749) Duplicate creating the RDD in file stream when recovering from checkpoint data
[ https://issues.apache.org/jira/browse/SPARK-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-11749: - Affects Version/s: (was: 1.5.0) 1.6.0 > Duplicate creating the RDD in file stream when recovering from checkpoint data > -- > > Key: SPARK-11749 > URL: https://issues.apache.org/jira/browse/SPARK-11749 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2, 1.6.0 >Reporter: Jack Hu >Assignee: Jack Hu > Fix For: 1.6.0 > > > I have a case to monitor a HDFS folder, then enrich the incoming data from > the HDFS folder via different table (about 15 reference tables) and send to > different hive table after some operations. > The code is as this: > {code} > val txt = > ssc.textFileStream(folder).map(toKeyValuePair).reduceByKey(removeDuplicates) > val refTable1 = > ssc.textFileStream(refSource1).map(parse(_)).updateStateByKey(...) > txt.join(refTable1).map(..).reduceByKey(...).foreachRDD( > rdd => { > // insert into hive table > } > ) > val refTable2 = > ssc.textFileStream(refSource2).map(parse(_)).updateStateByKey(...) > txt.join(refTable2).map(..).reduceByKey(...).foreachRDD( > rdd => { > // insert into hive table > } > ) > /// more refTables in following code > {code} > > The {{batchInterval}} of this application is set to *30 seconds*, the > checkpoint interval is set to *10 minutes*, every batch in {{txt}} has *60 > files* > After recovered from checkpoint data, I can see lots of log to create the RDD > in file stream: rdd in each batch of file stream was been recreated *15 > times*, and it takes about *5 minutes* to create so much file RDD. During > this period, *10K+ broadcast* had been created and almost used all the block > manager space. > After some investigation, we found that the {{DStream.restoreCheckpointData}} > would be invoked at each output ({{DStream.foreachRDD}} in this case), and no > flag to indicate that this {{DStream}} had been restored, so the RDD in file > stream was been recreated. > Suggest to add on flag to control the restore process to avoid the duplicated > work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12376) Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method
[ https://issues.apache.org/jira/browse/SPARK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12376: - Fix Version/s: (was: 1.6.1) (was: 2.0.0) 1.6.0 > Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method > > > Key: SPARK-12376 > URL: https://issues.apache.org/jira/browse/SPARK-12376 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 > Environment: Oracle Java 64-bit (build 1.8.0_66-b17) >Reporter: Evan Chen >Assignee: Evan Chen >Priority: Minor > Fix For: 1.6.0 > > > org.apache.spark.streaming.Java8APISuite.java is failing due to trying to > sort immutable list in assertOrderInvariantEquals method. > Here are the errors: > Tests run: 27, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 5.948 sec > <<< FAILURE! - in org.apache.spark.streaming.Java8APISuite > testMap(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.217 sec > <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > testFlatMap(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.203 > sec <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > testFilter(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.209 sec > <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > testTransform(org.apache.spark.streaming.Java8APISuite) Time elapsed: 0.215 > sec <<< ERROR! > java.lang.UnsupportedOperationException: null > at java.util.AbstractList.set(AbstractList.java:132) > at java.util.AbstractList$ListItr.set(AbstractList.java:426) > at java.util.List.sort(List.java:482) > at java.util.Collections.sort(Collections.java:141) > at > org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444) > Results : > Tests in error: > > Java8APISuite.testFilter:81->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation > > Java8APISuite.testFlatMap:360->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation > > Java8APISuite.testMap:63->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation > > Java8APISuite.testTransform:168->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444 > » UnsupportedOperation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11749) Duplicate creating the RDD in file stream when recovering from checkpoint data
[ https://issues.apache.org/jira/browse/SPARK-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-11749: - Fix Version/s: (was: 1.6.1) (was: 2.0.0) 1.6.0 > Duplicate creating the RDD in file stream when recovering from checkpoint data > -- > > Key: SPARK-11749 > URL: https://issues.apache.org/jira/browse/SPARK-11749 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2, 1.6.0 >Reporter: Jack Hu >Assignee: Jack Hu > Fix For: 1.6.0 > > > I have a case to monitor a HDFS folder, then enrich the incoming data from > the HDFS folder via different table (about 15 reference tables) and send to > different hive table after some operations. > The code is as this: > {code} > val txt = > ssc.textFileStream(folder).map(toKeyValuePair).reduceByKey(removeDuplicates) > val refTable1 = > ssc.textFileStream(refSource1).map(parse(_)).updateStateByKey(...) > txt.join(refTable1).map(..).reduceByKey(...).foreachRDD( > rdd => { > // insert into hive table > } > ) > val refTable2 = > ssc.textFileStream(refSource2).map(parse(_)).updateStateByKey(...) > txt.join(refTable2).map(..).reduceByKey(...).foreachRDD( > rdd => { > // insert into hive table > } > ) > /// more refTables in following code > {code} > > The {{batchInterval}} of this application is set to *30 seconds*, the > checkpoint interval is set to *10 minutes*, every batch in {{txt}} has *60 > files* > After recovered from checkpoint data, I can see lots of log to create the RDD > in file stream: rdd in each batch of file stream was been recreated *15 > times*, and it takes about *5 minutes* to create so much file RDD. During > this period, *10K+ broadcast* had been created and almost used all the block > manager space. > After some investigation, we found that the {{DStream.restoreCheckpointData}} > would be invoked at each output ({{DStream.foreachRDD}} in this case), and no > flag to indicate that this {{DStream}} had been restored, so the RDD in file > stream was been recreated. > Suggest to add on flag to control the restore process to avoid the duplicated > work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12386) Setting "spark.executor.port" leads to NPE in SparkEnv
[ https://issues.apache.org/jira/browse/SPARK-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12386: - Target Version/s: 1.6.0 (was: 1.6.1, 2.0.0) > Setting "spark.executor.port" leads to NPE in SparkEnv > -- > > Key: SPARK-12386 > URL: https://issues.apache.org/jira/browse/SPARK-12386 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Critical > Fix For: 1.6.0 > > > From the list: > {quote} > when we set spark.executor.port in 1.6, we get thrown a NPE in > SparkEnv$.create(SparkEnv.scala:259). > {quote} > Fix is simple; probably should make it to 1.6.0 since it will affect anyone > using that config options, but I'll leave that to the release manager's > discretion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12386) Setting "spark.executor.port" leads to NPE in SparkEnv
[ https://issues.apache.org/jira/browse/SPARK-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12386: - Fix Version/s: (was: 1.6.1) (was: 2.0.0) 1.6.0 > Setting "spark.executor.port" leads to NPE in SparkEnv > -- > > Key: SPARK-12386 > URL: https://issues.apache.org/jira/browse/SPARK-12386 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Critical > Fix For: 1.6.0 > > > From the list: > {quote} > when we set spark.executor.port in 1.6, we get thrown a NPE in > SparkEnv$.create(SparkEnv.scala:259). > {quote} > Fix is simple; probably should make it to 1.6.0 since it will affect anyone > using that config options, but I'll leave that to the release manager's > discretion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12410) "." and "|" used for String.split directly
[ https://issues.apache.org/jira/browse/SPARK-12410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-12410: - Fix Version/s: (was: 1.6.1) (was: 2.0.0) 1.6.0 > "." and "|" used for String.split directly > -- > > Key: SPARK-12410 > URL: https://issues.apache.org/jira/browse/SPARK-12410 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.4.1, 1.5.2, 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 1.4.2, 1.5.3, 1.6.0 > > > String.split accepts a regular expression, so we should escape "." and "|". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-12429: -- Assignee: Shixiong Zhu (was: Apache Spark) > Update documentation to show how to use accumulators and broadcasts with > Spark Streaming > > > Key: SPARK-12429 > URL: https://issues.apache.org/jira/browse/SPARK-12429 > Project: Spark > Issue Type: Documentation > Components: Documentation, Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > > Accumulators and Broadcasts with Spark Streaming cannot work perfectly when > restarting on driver failures. We need to add some example to guide the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-12429. --- Resolution: Fixed Fix Version/s: 1.6.0 > Update documentation to show how to use accumulators and broadcasts with > Spark Streaming > > > Key: SPARK-12429 > URL: https://issues.apache.org/jira/browse/SPARK-12429 > Project: Spark > Issue Type: Documentation > Components: Documentation, Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > > Accumulators and Broadcasts with Spark Streaming cannot work perfectly when > restarting on driver failures. We need to add some example to guide the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12061) Persist for Map/filter with Lambda Functions don't always read from Cache
[ https://issues.apache.org/jira/browse/SPARK-12061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058419#comment-15058419 ] Xiao Li edited comment on SPARK-12061 at 12/23/15 12:22 AM: The logical plan's cleanArgs do not match when we calling sameResult. was (Author: smilegator): Start working on it. Thanks! > Persist for Map/filter with Lambda Functions don't always read from Cache > - > > Key: SPARK-12061 > URL: https://issues.apache.org/jira/browse/SPARK-12061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li > > So far, the existing caching mechanisms do not work on dataset operations > when using map/filter with lambda functions. For example, > {code} > test("persist and then map/filter with lambda functions") { > val f = (i: Int) => i + 1 > val ds = Seq(1, 2, 3).toDS() > val mapped = ds.map(f) > mapped.cache() > val mapped2 = ds.map(f) > assertCached(mapped2) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12487) Add docs for Kafka message handler
[ https://issues.apache.org/jira/browse/SPARK-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-12487. --- Resolution: Fixed Fix Version/s: 1.6.0 > Add docs for Kafka message handler > -- > > Key: SPARK-12487 > URL: https://issues.apache.org/jira/browse/SPARK-12487 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068908#comment-15068908 ] Timothy Hunter commented on SPARK-12247: It seems to me that the calculation of false positives is more relevant for the movie ratings, and that the RMSE right above in the example is already a good example to but. What do you think? > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis
[ https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12102: - Target Version/s: (was: 1.6.0) > Cast a non-nullable struct field to a nullable field during analysis > > > Key: SPARK-12102 > URL: https://issues.apache.org/jira/browse/SPARK-12102 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Dilip Biswal > Fix For: 2.0.0 > > > If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, > cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will > see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > > 0) THEN > struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4) > as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE > expressions should all be same type or coercible to a common type; line 1 pos > 85}}. > The problem is the nullability difference between {{4}} (non-nullable) and > {{hash(4)}} (nullable). > Seems it makes sense to cast the nullability in the analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis
[ https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-12102. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10156 [https://github.com/apache/spark/pull/10156] > Cast a non-nullable struct field to a nullable field during analysis > > > Key: SPARK-12102 > URL: https://issues.apache.org/jira/browse/SPARK-12102 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > Fix For: 2.0.0 > > > If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, > cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will > see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > > 0) THEN > struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4) > as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE > expressions should all be same type or coercible to a common type; line 1 pos > 85}}. > The problem is the nullability difference between {{4}} (non-nullable) and > {{hash(4)}} (nullable). > Seems it makes sense to cast the nullability in the analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis
[ https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12102: - Assignee: Dilip Biswal > Cast a non-nullable struct field to a nullable field during analysis > > > Key: SPARK-12102 > URL: https://issues.apache.org/jira/browse/SPARK-12102 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Dilip Biswal > Fix For: 2.0.0 > > > If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, > cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will > see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > > 0) THEN > struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4) > as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE > expressions should all be same type or coercible to a common type; line 1 pos > 85}}. > The problem is the nullability difference between {{4}} (non-nullable) and > {{hash(4)}} (nullable). > Seems it makes sense to cast the nullability in the analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12441) Fixing missingInput in all Logical/Physical operators
[ https://issues.apache.org/jira/browse/SPARK-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12441: Summary: Fixing missingInput in all Logical/Physical operators (was: Fixing missingInput in Generate/MapPartitions/AppendColumns/MapGroups/CoGroup) > Fixing missingInput in all Logical/Physical operators > - > > Key: SPARK-12441 > URL: https://issues.apache.org/jira/browse/SPARK-12441 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Xiao Li > > The value of missingInput in > Generate/MapPartitions/AppendColumns/MapGroups/CoGroup is incorrect. > {code} > val df = Seq((1, "a b c"), (2, "a b"), (3, "a")).toDF("number", "letters") > val df2 = > df.explode('letters) { > case Row(letters: String) => letters.split(" ").map(Tuple1(_)).toSeq > } > df2.explain(true) > {code} > {code} > == Parsed Logical Plan == > 'Generate UserDefinedGenerator('letters), true, false, None > +- Project [_1#0 AS number#2,_2#1 AS letters#3] >+- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]] > == Analyzed Logical Plan == > number: int, letters: string, _1: string > Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8] > +- Project [_1#0 AS number#2,_2#1 AS letters#3] >+- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]] > == Optimized Logical Plan == > Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8] > +- LocalRelation [number#2,letters#3], [[1,a b c],[2,a b],[3,a]] > == Physical Plan == > !Generate UserDefinedGenerator(letters#3), true, false, > [number#2,letters#3,_1#8] > +- LocalTableScan [number#2,letters#3], [[1,a b c],[2,a b],[3,a]] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12491) UDAF result differs in SQL if alias is used
Tristan created SPARK-12491: --- Summary: UDAF result differs in SQL if alias is used Key: SPARK-12491 URL: https://issues.apache.org/jira/browse/SPARK-12491 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Reporter: Tristan Using the GeometricMean UDAF example (https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html), I found the following discrepancy in results: scala> sqlContext.sql("select group_id, gm(id) from simple group by group_id").show() ++---+ |group_id|_c1| ++---+ | 0|0.0| | 1|0.0| | 2|0.0| ++---+ scala> sqlContext.sql("select group_id, gm(id) as GeometricMean from simple group by group_id").show() ++-+ |group_id|GeometricMean| ++-+ | 0|8.981385496571725| | 1|7.301716979342118| | 2|7.706253151292568| ++-+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068827#comment-15068827 ] Ilya Ganelin commented on SPARK-12488: -- Further investigation identifies the issue as stemming from the docTermVector containing zero-vectors (as in no words from the vocabulary present in the document). > LDA describeTopics() Generates Invalid Term IDs > --- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Ilya Ganelin > > When running the LDA model, and using the describeTopics function, invalid > values appear in the termID list that is returned: > The below example generates 10 topics on a data set with a vocabulary of 685. > {code} > // Set LDA parameters > val numTopics = 10 > val lda = new LDA().setK(numTopics).setMaxIterations(10) > val ldaModel = lda.run(docTermVector) > val distModel = > ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted.reverse > res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, > 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, > 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, > 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, > 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, > 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, > 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, > 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, > 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, > 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, > 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted > res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, > -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, > -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, > -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, > -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, > -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, > -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, > -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, > -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, > 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, > 38, 39, 40, 41, 42, 43, 44, 45, 4... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12490) Don't use Javascript for web UI's paginated table navigation controls
[ https://issues.apache.org/jira/browse/SPARK-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12490: Assignee: Apache Spark (was: Josh Rosen) > Don't use Javascript for web UI's paginated table navigation controls > - > > Key: SPARK-12490 > URL: https://issues.apache.org/jira/browse/SPARK-12490 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Josh Rosen >Assignee: Apache Spark > > The web UI's paginated table uses Javascript to implement certain navigation > controls, such as table sorting and the "go to page" form. This is > unnecessary and should be simplified to use plain HTML form controls and > links. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12490) Don't use Javascript for web UI's paginated table navigation controls
[ https://issues.apache.org/jira/browse/SPARK-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068802#comment-15068802 ] Apache Spark commented on SPARK-12490: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/10441 > Don't use Javascript for web UI's paginated table navigation controls > - > > Key: SPARK-12490 > URL: https://issues.apache.org/jira/browse/SPARK-12490 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Josh Rosen >Assignee: Josh Rosen > > The web UI's paginated table uses Javascript to implement certain navigation > controls, such as table sorting and the "go to page" form. This is > unnecessary and should be simplified to use plain HTML form controls and > links. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12490) Don't use Javascript for web UI's paginated table navigation controls
Josh Rosen created SPARK-12490: -- Summary: Don't use Javascript for web UI's paginated table navigation controls Key: SPARK-12490 URL: https://issues.apache.org/jira/browse/SPARK-12490 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Josh Rosen Assignee: Josh Rosen The web UI's paginated table uses Javascript to implement certain navigation controls, such as table sorting and the "go to page" form. This is unnecessary and should be simplified to use plain HTML form controls and links. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12489) Fix minor issues found by Findbugs
[ https://issues.apache.org/jira/browse/SPARK-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12489: Assignee: Apache Spark > Fix minor issues found by Findbugs > -- > > Key: SPARK-12489 > URL: https://issues.apache.org/jira/browse/SPARK-12489 > Project: Spark > Issue Type: Bug > Components: MLlib, Spark Core, SQL >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Minor > > Just used FindBugs to scan the codes and fixed some real issues: > 1. Close `java.sql.Statement` > 2. Fix incorrect `asInstanceOf`. > 3. Remove unnecessary `synchronized` and `ReentrantLock`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12489) Fix minor issues found by Findbugs
[ https://issues.apache.org/jira/browse/SPARK-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12489: Assignee: (was: Apache Spark) > Fix minor issues found by Findbugs > -- > > Key: SPARK-12489 > URL: https://issues.apache.org/jira/browse/SPARK-12489 > Project: Spark > Issue Type: Bug > Components: MLlib, Spark Core, SQL >Reporter: Shixiong Zhu >Priority: Minor > > Just used FindBugs to scan the codes and fixed some real issues: > 1. Close `java.sql.Statement` > 2. Fix incorrect `asInstanceOf`. > 3. Remove unnecessary `synchronized` and `ReentrantLock`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12489) Fix minor issues found by Findbugs
Shixiong Zhu created SPARK-12489: Summary: Fix minor issues found by Findbugs Key: SPARK-12489 URL: https://issues.apache.org/jira/browse/SPARK-12489 Project: Spark Issue Type: Bug Components: MLlib, Spark Core, SQL Reporter: Shixiong Zhu Priority: Minor Just used FindBugs to scan the codes and fixed some real issues: 1. Close `java.sql.Statement` 2. Fix incorrect `asInstanceOf`. 3. Remove unnecessary `synchronized` and `ReentrantLock`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
[ https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068693#comment-15068693 ] Pete Robbins edited comment on SPARK-12470 at 12/22/15 9:47 PM: I'm fairly sure the code in my PR is correct but it is causing an ExchangeCoordinatorSuite test to fail. I'm struggling to see why this test is failing with the change I made. The failure is: determining the number of reducers: aggregate operator *** FAILED *** 3 did not equal 2 (ExchangeCoordinatorSuite.scala:316) putting some debug into the test I see that before my change the pre-shuffle partition sizes are 600, 600, 600, 600, 600 an after my change are 800. 800. 800. 800. 720 but I have no idea why. I'd really appreciate anyone with knowledge of this area a) checking my PR and b) helping explain the failing test. EDIT Please ignore. Merged with latest head including changes for SPARK-12388 now passes all tests was (Author: robbinspg): I'm fairly sure the code in my PR is correct but it is causing an ExchangeCoordinatorSuite test to fail. I'm struggling to see why this test is failing with the change I made. The failure is: determining the number of reducers: aggregate operator *** FAILED *** 3 did not equal 2 (ExchangeCoordinatorSuite.scala:316) putting some debug into the test I see that before my change the pre-shuffle partition sizes are 600, 600, 600, 600, 600 an after my change are 800. 800. 800. 800. 720 but I have no idea why. I'd really appreciate anyone with knowledge of this area a) checking my PR and b) helping explain the failing test. > Incorrect calculation of row size in > o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner > --- > > Key: SPARK-12470 > URL: https://issues.apache.org/jira/browse/SPARK-12470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Pete Robbins >Priority: Minor > > While looking into https://issues.apache.org/jira/browse/SPARK-12319 I > noticed that the row size is incorrectly calculated. > The "sizeReduction" value is calculated in words: >// The number of words we can reduce when we concat two rows together. > // The only reduction comes from merging the bitset portion of the two > rows, saving 1 word. > val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords > but then it is subtracted from the size of the row in bytes: >|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - > $sizeReduction); > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068761#comment-15068761 ] Ilya Ganelin edited comment on SPARK-12488 at 12/22/15 9:32 PM: [~josephkb] Would love your feedback here. Thanks! was (Author: ilganeli): @jkbradley Would love your feedback here. Thanks! > LDA describeTopics() Generates Invalid Term IDs > --- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Ilya Ganelin > > When running the LDA model, and using the describeTopics function, invalid > values appear in the termID list that is returned: > The below example generates 10 topics on a data set with a vocabulary of 685. > {code} > // Set LDA parameters > val numTopics = 10 > val lda = new LDA().setK(numTopics).setMaxIterations(10) > val ldaModel = lda.run(docTermVector) > val distModel = > ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted.reverse > res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, > 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, > 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, > 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, > 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, > 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, > 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, > 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, > 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, > 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, > 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted > res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, > -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, > -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, > -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, > -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, > -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, > -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, > -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, > -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, > 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, > 38, 39, 40, 41, 42, 43, 44, 45, 4... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068761#comment-15068761 ] Ilya Ganelin commented on SPARK-12488: -- @jkbradley Would love your feedback here. Thanks! > LDA describeTopics() Generates Invalid Term IDs > --- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Ilya Ganelin > > When running the LDA model, and using the describeTopics function, invalid > values appear in the termID list that is returned: > The below example generates 10 topics on a data set with a vocabulary of 685. > {code} > // Set LDA parameters > val numTopics = 10 > val lda = new LDA().setK(numTopics).setMaxIterations(10) > val ldaModel = lda.run(docTermVector) > val distModel = > ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted.reverse > res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, > 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, > 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, > 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, > 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, > 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, > 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, > 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, > 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, > 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, > 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted > res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, > -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, > -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, > -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, > -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, > -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, > -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, > -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, > -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, > 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, > 38, 39, 40, 41, 42, 43, 44, 45, 4... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12471) Spark daemons should log their pid in the log file
[ https://issues.apache.org/jira/browse/SPARK-12471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12471. - Resolution: Fixed Assignee: Nong Li (was: Apache Spark) Fix Version/s: 2.0.0 > Spark daemons should log their pid in the log file > -- > > Key: SPARK-12471 > URL: https://issues.apache.org/jira/browse/SPARK-12471 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nong Li >Assignee: Nong Li > Fix For: 2.0.0 > > > This is useful when debugging from the log files without the processes > running. This information makes it possible to combine the log files with > other system information (e.g. dmesg output) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ganelin updated SPARK-12488: - Description: When running the LDA model, and using the describeTopics function, invalid values appear in the termID list that is returned: The below example generates 10 topics on a data set with a vocabulary of 685. {code} // Set LDA parameters val numTopics = 10 val lda = new LDA().setK(numTopics).setMaxIterations(10) val ldaModel = lda.run(docTermVector) val distModel = ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] {code} {code} scala> ldaModel.describeTopics()(0)._1.sorted.reverse res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... {code} {code} scala> ldaModel.describeTopics()(0)._1.sorted res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 4... {code} was: When running the LDA model, and using the describeTopics function, invalid values appear in the termID list that is returned: The below example generated 10 topics on a data set with a vocabulary of 685. {code} // Set LDA parameters val numTopics = 10 val lda = new LDA().setK(numTopics).setMaxIterations(10) val ldaModel = lda.run(docTermVector) val distModel = ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] {code} {code} scala> ldaModel.describeTopics()(0)._1.sorted.reverse res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... {code} {code} scala> ldaModel.describeTopics()(0)._1.sorted res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 4... {code} > LDA describeTopics() Generates Invalid Term IDs > --- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2
[jira] [Updated] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ganelin updated SPARK-12488: - Summary: LDA describeTopics() Generates Invalid Term IDs (was: LDA Describe Topics Generates Invalid Term IDs) > LDA describeTopics() Generates Invalid Term IDs > --- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Ilya Ganelin > > When running the LDA model, and using the describeTopics function, invalid > values appear in the termID list that is returned: > The below example generated 10 topics on a data set with a vocabulary of 685. > {code} > // Set LDA parameters > val numTopics = 10 > val lda = new LDA().setK(numTopics).setMaxIterations(10) > val ldaModel = lda.run(docTermVector) > val distModel = > ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted.reverse > res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, > 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, > 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, > 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, > 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, > 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, > 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, > 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, > 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, > 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, > 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted > res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, > -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, > -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, > -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, > -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, > -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, > -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, > -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, > -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, > 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, > 38, 39, 40, 41, 42, 43, 44, 45, 4... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12488) LDA Describe Topics Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilya Ganelin updated SPARK-12488: - Description: When running the LDA model, and using the describeTopics function, invalid values appear in the termID list that is returned: The below example generated 10 topics on a data set with a vocabulary of 685. {code} // Set LDA parameters val numTopics = 10 val lda = new LDA().setK(numTopics).setMaxIterations(10) val ldaModel = lda.run(docTermVector) val distModel = ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] {code} {code} scala> ldaModel.describeTopics()(0)._1.sorted.reverse res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... {code} {code} scala> ldaModel.describeTopics()(0)._1.sorted res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 4... {code} was: When running the LDA model, and using the describeTopics function, invalid values appear in the termID list that is returned: {code} // Set LDA parameters val numTopics = 10 val lda = new LDA().setK(numTopics).setMaxIterations(10) val ldaModel = lda.run(docTermVector) val distModel = ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] {code} > LDA Describe Topics Generates Invalid Term IDs > -- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Ilya Ganelin > > When running the LDA model, and using the describeTopics function, invalid > values appear in the termID list that is returned: > The below example generated 10 topics on a data set with a vocabulary of 685. > {code} > // Set LDA parameters > val numTopics = 10 > val lda = new LDA().setK(numTopics).setMaxIterations(10) > val ldaModel = lda.run(docTermVector) > val distModel = > ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted.reverse > res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, > 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, > 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, > 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, > 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, > 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, > 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, > 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, > 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, > 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, > 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted > res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, > -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, > -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, > -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, > -1397991169, -
[jira] [Commented] (SPARK-12487) Add docs for Kafka message handler
[ https://issues.apache.org/jira/browse/SPARK-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068748#comment-15068748 ] Apache Spark commented on SPARK-12487: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/10439 > Add docs for Kafka message handler > -- > > Key: SPARK-12487 > URL: https://issues.apache.org/jira/browse/SPARK-12487 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12488) LDA Describe Topics Generates Invalid Term IDs
Ilya Ganelin created SPARK-12488: Summary: LDA Describe Topics Generates Invalid Term IDs Key: SPARK-12488 URL: https://issues.apache.org/jira/browse/SPARK-12488 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.5.2 Reporter: Ilya Ganelin When running the LDA model, and using the describeTopics function, invalid values appear in the termID list that is returned: {code} // Set LDA parameters val numTopics = 10 val lda = new LDA().setK(numTopics).setMaxIterations(10) val ldaModel = lda.run(docTermVector) val distModel = ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"
[ https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068731#comment-15068731 ] Sean Owen commented on SPARK-12485: --- I must say I don't think that's worth it. Both terms seem equally sound, but dynamic allocation has been widely used to describe the feature externally. Renaming it doesn't seem to add much of anything in comparison. > Rename "dynamic allocation" to "elastic scaling" > > > Key: SPARK-12485 > URL: https://issues.apache.org/jira/browse/SPARK-12485 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Andrew Or >Assignee: Andrew Or > > Fewer syllables, sounds more natural. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12487) Add docs for Kafka message handler
[ https://issues.apache.org/jira/browse/SPARK-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12487: Assignee: Apache Spark (was: Shixiong Zhu) > Add docs for Kafka message handler > -- > > Key: SPARK-12487 > URL: https://issues.apache.org/jira/browse/SPARK-12487 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Shixiong Zhu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12487) Add docs for Kafka message handler
[ https://issues.apache.org/jira/browse/SPARK-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12487: Assignee: Shixiong Zhu (was: Apache Spark) > Add docs for Kafka message handler > -- > > Key: SPARK-12487 > URL: https://issues.apache.org/jira/browse/SPARK-12487 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12487) Add docs for Kafka message handler
Shixiong Zhu created SPARK-12487: Summary: Add docs for Kafka message handler Key: SPARK-12487 URL: https://issues.apache.org/jira/browse/SPARK-12487 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Shixiong Zhu Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12486) Executors are not always terminated successfully by the worker.
Nong Li created SPARK-12486: --- Summary: Executors are not always terminated successfully by the worker. Key: SPARK-12486 URL: https://issues.apache.org/jira/browse/SPARK-12486 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Nong Li There are cases when the executor is not killed successfully by the worker. One way this can happen is if the executor is in a bad state, fails to heartbeat and the master tells the worker to kill the executor. The executor is in such a bad state that the kill request is ignored. This seems to be able to happen if the executor is in heavy GC. The cause of this is that the Process.destroy() API is not forceful enough. In Java8, a new API, destroyForcibly() was added. We should use that if available. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12414) Remove closure serializer
[ https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reassigned SPARK-12414: - Assignee: Andrew Or > Remove closure serializer > - > > Key: SPARK-12414 > URL: https://issues.apache.org/jira/browse/SPARK-12414 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > There is a config `spark.closure.serializer` that accepts exactly one value: > the java serializer. This is because there are currently bugs in the Kryo > serializer that make it not a viable candidate. This was uncovered by an > unsuccessful attempt to make it work: SPARK-7708. > My high level point is that the Java serializer has worked well for at least > 6 Spark versions now, and it is an incredibly complicated task to get other > serializers (not just Kryo) to work with Spark's closures. IMO the effort is > not worth it and we should just remove this documentation and all the code > associated with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"
Andrew Or created SPARK-12485: - Summary: Rename "dynamic allocation" to "elastic scaling" Key: SPARK-12485 URL: https://issues.apache.org/jira/browse/SPARK-12485 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Andrew Or Assignee: Andrew Or Fewer syllables, sounds more natural. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"
[ https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12485: -- Target Version/s: 2.0.0 > Rename "dynamic allocation" to "elastic scaling" > > > Key: SPARK-12485 > URL: https://issues.apache.org/jira/browse/SPARK-12485 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Andrew Or >Assignee: Andrew Or > > Fewer syllables, sounds more natural. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068698#comment-15068698 ] Andrew Davidson commented on SPARK-12484: - releated issue https://issues.apache.org/jira/browse/SPARK-12483 > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068696#comment-15068696 ] Andrew Davidson commented on SPARK-12484: - What I am really trying to do is rewrite the following python code in Java. Ideally I would implement this code as a MLib.transformation how ever that does not seem possible at this point in time using the Java API Kind regards Andy def convertMultinomialLabelToBinary(dataFrame): newColName = "binomialLabel" binomial = udf(lambda labelStr: labelStr if (labelStr == "noise") else "signal", StringType()) ret = dataFrame.withColumn(newColName, binomial(dataFrame["label"])) return ret > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
[ https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068693#comment-15068693 ] Pete Robbins commented on SPARK-12470: -- I'm fairly sure the code in my PR is correct but it is causing an ExchangeCoordinatorSuite test to fail. I'm struggling to see why this test is failing with the change I made. The failure is: determining the number of reducers: aggregate operator *** FAILED *** 3 did not equal 2 (ExchangeCoordinatorSuite.scala:316) putting some debug into the test I see that before my change the pre-shuffle partition sizes are 600, 600, 600, 600, 600 an after my change are 800. 800. 800. 800. 720 but I have no idea why. I'd really appreciate anyone with knowledge of this area a) checking my PR and b) helping explain the failing test. > Incorrect calculation of row size in > o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner > --- > > Key: SPARK-12470 > URL: https://issues.apache.org/jira/browse/SPARK-12470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Pete Robbins >Priority: Minor > > While looking into https://issues.apache.org/jira/browse/SPARK-12319 I > noticed that the row size is incorrectly calculated. > The "sizeReduction" value is calculated in words: >// The number of words we can reduce when we concat two rows together. > // The only reduction comes from merging the bitset portion of the two > rows, saving 1 word. > val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords > but then it is subtracted from the size of the row in bytes: >|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - > $sizeReduction); > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068684#comment-15068684 ] Andrew Davidson commented on SPARK-12484: - you can find some more back ground on the email thread 'should I file a bug? Re: trouble implementing Transformer and calling DataFrame.withColumn()' > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12484) DataFrame withColumn() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12484: Attachment: UDFTest.java Add a unit test file > DataFrame withColumn() does not work in Java > > > Key: SPARK-12484 > URL: https://issues.apache.org/jira/browse/SPARK-12484 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: mac El Cap. 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: UDFTest.java > > > DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises > org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing > from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS > transformedByUDF#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) > at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) > at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12484) DataFrame withColumn() does not work in Java
Andrew Davidson created SPARK-12484: --- Summary: DataFrame withColumn() does not work in Java Key: SPARK-12484 URL: https://issues.apache.org/jira/browse/SPARK-12484 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Environment: mac El Cap. 10.11.2 Java 8 Reporter: Andrew Davidson DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS transformedByUDF#3]; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691) at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150) at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
[ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068627#comment-15068627 ] Maciej Bryński edited comment on SPARK-11437 at 12/22/15 8:24 PM: -- [~davies], [~jason.white] Are you sure that this patch is OK ? In 1.6.0 if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} was (Author: maver1ck): [~davies], [~jason.white] Are you sure that this patch is OK ? Right now if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} > createDataFrame shouldn't .take() when provided schema > -- > > Key: SPARK-11437 > URL: https://issues.apache.org/jira/browse/SPARK-11437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jason White >Assignee: Jason White > Fix For: 1.6.0 > > > When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls > `.take(10)` to verify the first 10 rows of the RDD match the provided schema. > Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue > affected cases where a schema was not provided. > Verifying the first 10 rows is of limited utility and causes the DAG to be > executed non-lazily. If necessary, I believe this verification should be done > lazily on all rows. However, since the caller is providing a schema to > follow, I think it's acceptable to simply fail if the schema is incorrect. > https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068670#comment-15068670 ] Apache Spark commented on SPARK-12462: -- User 'xguo27' has created a pull request for this issue: https://github.com/apache/spark/pull/10437 > Add ExpressionDescription to misc non-aggregate functions > - > > Key: SPARK-12462 > URL: https://issues.apache.org/jira/browse/SPARK-12462 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12483: Attachment: SPARK_12483_unitTest.java added a unit test file > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12483: Comment: was deleted (was: add a unit test file ) > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12483: Attachment: (was: SPARK_12483_unitTest.java) > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java
[ https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Davidson updated SPARK-12483: Attachment: SPARK_12483_unitTest.java add a unit test file > Data Frame as() does not work in Java > - > > Key: SPARK-12483 > URL: https://issues.apache.org/jira/browse/SPARK-12483 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: Mac El Cap 10.11.2 > Java 8 >Reporter: Andrew Davidson > Attachments: SPARK_12483_unitTest.java > > > Following unit test demonstrates a bug in as(). The column name for aliasDF > was not changed >@Test > public void bugDataFrameAsTest() { > DataFrame df = createData(); > df.printSchema(); > df.show(); > > DataFrame aliasDF = df.select("id").as("UUID"); > aliasDF.printSchema(); > aliasDF.show(); > } > DataFrame createData() { > Features f1 = new Features(1, category1); > Features f2 = new Features(2, category2); > ArrayList data = new ArrayList(2); > data.add(f1); > data.add(f2); > //JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(f1, f2)); > JavaRDD rdd = javaSparkContext.parallelize(data); > DataFrame df = sqlContext.createDataFrame(rdd, Features.class); > return df; > } > This is the output I got (without the spark log msgs) > root > |-- id: integer (nullable = false) > |-- labelStr: string (nullable = true) > +---++ > | id|labelStr| > +---++ > | 1| noise| > | 2|questionable| > +---++ > root > |-- id: integer (nullable = false) > +---+ > | id| > +---+ > | 1| > | 2| > +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12483) Data Frame as() does not work in Java
Andrew Davidson created SPARK-12483: --- Summary: Data Frame as() does not work in Java Key: SPARK-12483 URL: https://issues.apache.org/jira/browse/SPARK-12483 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2 Environment: Mac El Cap 10.11.2 Java 8 Reporter: Andrew Davidson Following unit test demonstrates a bug in as(). The column name for aliasDF was not changed @Test public void bugDataFrameAsTest() { DataFrame df = createData(); df.printSchema(); df.show(); DataFrame aliasDF = df.select("id").as("UUID"); aliasDF.printSchema(); aliasDF.show(); } DataFrame createData() { Features f1 = new Features(1, category1); Features f2 = new Features(2, category2); ArrayList data = new ArrayList(2); data.add(f1); data.add(f2); //JavaRDD rdd = javaSparkContext.parallelize(Arrays.asList(f1, f2)); JavaRDD rdd = javaSparkContext.parallelize(data); DataFrame df = sqlContext.createDataFrame(rdd, Features.class); return df; } This is the output I got (without the spark log msgs) root |-- id: integer (nullable = false) |-- labelStr: string (nullable = true) +---++ | id|labelStr| +---++ | 1| noise| | 2|questionable| +---++ root |-- id: integer (nullable = false) +---+ | id| +---+ | 1| | 2| +---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
[ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068627#comment-15068627 ] Maciej Bryński edited comment on SPARK-11437 at 12/22/15 7:45 PM: -- [~davies], [~jason.white] Are you sure that this patch is OK ? Right now if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} was (Author: maver1ck): [~davies] Are you sure that this patch is OK ? Right now if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} > createDataFrame shouldn't .take() when provided schema > -- > > Key: SPARK-11437 > URL: https://issues.apache.org/jira/browse/SPARK-11437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jason White >Assignee: Jason White > Fix For: 1.6.0 > > > When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls > `.take(10)` to verify the first 10 rows of the RDD match the provided schema. > Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue > affected cases where a schema was not provided. > Verifying the first 10 rows is of limited utility and causes the DAG to be > executed non-lazily. If necessary, I believe this verification should be done > lazily on all rows. However, since the caller is providing a schema to > follow, I think it's acceptable to simply fail if the schema is incorrect. > https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
[ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068627#comment-15068627 ] Maciej Bryński edited comment on SPARK-11437 at 12/22/15 7:44 PM: -- [~davies] Are you sure that this patch is OK ? Right now if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} was (Author: maver1ck): [~davies] Are you sure that this patch is OK ? Right now if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} from pyspark.sql.types import * schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} from pyspark.sql.types import * schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} > createDataFrame shouldn't .take() when provided schema > -- > > Key: SPARK-11437 > URL: https://issues.apache.org/jira/browse/SPARK-11437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jason White >Assignee: Jason White > Fix For: 1.6.0 > > > When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls > `.take(10)` to verify the first 10 rows of the RDD match the provided schema. > Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue > affected cases where a schema was not provided. > Verifying the first 10 rows is of limited utility and causes the DAG to be > executed non-lazily. If necessary, I believe this verification should be done > lazily on all rows. However, since the caller is providing a schema to > follow, I think it's acceptable to simply fail if the schema is incorrect. > https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
[ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068627#comment-15068627 ] Maciej Bryński commented on SPARK-11437: [~davies] Are you sure that this patch is OK ? Right now if I'm creating DataFrame from RDD of Rows there is no schema validation. So we can create schema with wrong types. {code} from pyspark.sql.types import * schema = StructType([StructField("id", IntegerType()), StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(id=1, name=None)] {code} Even better. Column can change places. {code} from pyspark.sql.types import * schema = StructType([StructField("name", IntegerType())]) sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), schema).collect() [Row(name=1)] {code} > createDataFrame shouldn't .take() when provided schema > -- > > Key: SPARK-11437 > URL: https://issues.apache.org/jira/browse/SPARK-11437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jason White >Assignee: Jason White > Fix For: 1.6.0 > > > When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls > `.take(10)` to verify the first 10 rows of the RDD match the provided schema. > Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue > affected cases where a schema was not provided. > Verifying the first 10 rows is of limited utility and causes the DAG to be > executed non-lazily. If necessary, I believe this verification should be done > lazily on all rows. However, since the caller is providing a schema to > follow, I think it's acceptable to simply fail if the schema is incorrect. > https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12462: Assignee: Apache Spark > Add ExpressionDescription to misc non-aggregate functions > - > > Key: SPARK-12462 > URL: https://issues.apache.org/jira/browse/SPARK-12462 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068572#comment-15068572 ] Kyle Sutton edited comment on SPARK-6476 at 12/22/15 7:14 PM: -- The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the _Spark_ service attempts to call back to the default IP of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} * A connection is made from the driver host to the _Spark_ service ** {{spark.driver.host}} is set to the IP of the driver host on the LAN {{172.30.0.2}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the driver host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the driver host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect {code:title=Driver|borderStyle=solid} SparkConf conf = new SparkConf() .setMaster("spark://172.30.0.3:7077") .setAppName("TestApp") .set("spark.driver.host", "172.30.0.2") .set("spark.driver.port", "50003") .set("spark.fileserver.port", "50005"); JavaSparkContext sc = new JavaSparkContext(conf); sc.addJar("target/code.jar"); {code} {code:title=Stacktrace|borderStyle=solid} 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.30.0.3): java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.
[jira] [Assigned] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12462: Assignee: (was: Apache Spark) > Add ExpressionDescription to misc non-aggregate functions > - > > Key: SPARK-12462 > URL: https://issues.apache.org/jira/browse/SPARK-12462 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12462: Assignee: Apache Spark > Add ExpressionDescription to misc non-aggregate functions > - > > Key: SPARK-12462 > URL: https://issues.apache.org/jira/browse/SPARK-12462 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068572#comment-15068572 ] Kyle Sutton commented on SPARK-6476: The issue of the file server using the default IP instead of the IP address configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ The problem is that, while the file server is listening on all ports on the file server host, the _Spark_ service attempts to call back to the default port of the host, to which it may or may not have connectivity. For instance, the following setup causes a {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact the _Spark_ driver host for a JAR: * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN connection IP of {{172.30.0.2}} * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} * A connection is made from the driver host to the _Spark_ service ** {{spark.driver.host}} is set to the IP of the driver host on the LAN {{172.30.0.2}} ** {{spark.driver.port}} is set to {{50003}} ** {{spark.fileserver.port}} is set to {{50005}} * Locally (on the driver host), the following listeners are active: ** {{0.0.0.0:50005}} ** {{172.30.0.2:50003}} * The _Spark_ service calls back to the file server host for a JAR file using the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} * The _Spark_ service, being on a different network than the driver host, cannot see the {{192.168.1.0/24}} address space, and fails to connect to the file server ** A {{netstat}} on the _Spark_ service host will show the connection to the file server host as being in {{SYN_SENT}} state until the process gives up trying to connect {code:title=Driver|borderStyle=solid} SparkConf conf = new SparkConf() .setMaster("spark://172.30.0.3:7077") .setAppName("TestApp") .set("spark.driver.host", "172.30.0.2") .set("spark.driver.port", "50003") .set("spark.fileserver.port", "50005"); JavaSparkContext sc = new JavaSparkContext(conf); sc.addJar("target/code.jar"); {code} {code:title=Stacktrace|borderStyle=solid} 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.30.0.3): java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.
[jira] [Commented] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068568#comment-15068568 ] Kyle Sutton commented on SPARK-12482: - Thanks! I did. I think he's saying that fileserver is listening on all ports, but if the Spark service can't see the IP given it by the Spark driver, the ports are immaterial. > Spark fileserver not started on same IP as configured in spark.driver.host > -- > > Key: SPARK-12482 > URL: https://issues.apache.org/jira/browse/SPARK-12482 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1, 1.5.2 >Reporter: Kyle Sutton > > The issue of the file server using the default IP instead of the IP address > configured through {{spark.driver.host}} still exists in _Spark 1.5.2_ > The problem is that, while the file server is listening on all ports on the > file server host, the _Spark_ service attempts to call back to the default > port of the host, to which it may or may not have connectivity. > For instance, the following setup causes a > {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact > the _Spark_ driver host for a JAR: > * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN > connection IP of {{172.30.0.2}} > * _Spark_ service is on the LAN with an IP of {{172.30.0.3}} > * A connection is made from the driver host to the _Spark_ service > ** {{spark.driver.host}} is set to the IP of the driver host on the LAN > {{172.30.0.2}} > ** {{spark.driver.port}} is set to {{50003}} > ** {{spark.fileserver.port}} is set to {{50005}} > * Locally (on the driver host), the following listeners are active: > ** {{0.0.0.0:50005}} > ** {{172.30.0.2:50003}} > * The _Spark_ service calls back to the file server host for a JAR file using > the driver host's default IP: {{http://192.168.1.2:50005/jars/code.jar}} > * The _Spark_ service, being on a different network than the driver host, > cannot see the {{192.168.1.0/24}} address space, and fails to connect to the > file server > ** A {{netstat}} on the _Spark_ service host will show the connection to the > file server host as being in {{SYN_SENT}} state until the process gives up > trying to connect > {code:title=Driver|borderStyle=solid} > SparkConf conf = new SparkConf() > .setMaster("spark://172.30.0.3:7077") > .setAppName("TestApp") > .set("spark.driver.host", "172.30.0.2") > .set("spark.driver.port", "50003") > .set("spark.fileserver.port", "50005"); > JavaSparkContext sc = new JavaSparkContext(conf); > sc.addJar("target/code.jar"); > {code} > {code:title=Stacktrace|borderStyle=solid} > 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > 172.30.0.3): java.net.SocketTimeoutException: connect timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at sun.net.NetworkClient.doConnect(NetworkClient.java:175) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > s