date:20151222

[jira] [Assigned] (SPARK-12495) use true as default value for propagateNull in NewInstance

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12495:


Assignee: Apache Spark

> use true as default value for propagateNull in NewInstance
> --
>
> Key: SPARK-12495
> URL: https://issues.apache.org/jira/browse/SPARK-12495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12495) use true as default value for propagateNull in NewInstance

2015-12-22 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-12495:

Summary: use true as default value for propagateNull in NewInstance  (was: 
use false as default value for propagateNull in NewInstance)

> use true as default value for propagateNull in NewInstance
> --
>
> Key: SPARK-12495
> URL: https://issues.apache.org/jira/browse/SPARK-12495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12495) use false as default value for propagateNull in NewInstance

2015-12-22 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-12495:
---

 Summary: use false as default value for propagateNull in 
NewInstance
 Key: SPARK-12495
 URL: https://issues.apache.org/jira/browse/SPARK-12495
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.

2015-12-22 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069345#comment-15069345
 ] 

Jean-Baptiste Onofré commented on SPARK-12430:
--

Let me take a look on that.

> Temporary folders do not get deleted after Task completes causing problems 
> with disk space.
> ---
>
> Key: SPARK-12430
> URL: https://issues.apache.org/jira/browse/SPARK-12430
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
> Environment: Ubuntu server
>Reporter: Fede Bar
>
> We are experiencing an issue with automatic /tmp folder deletion after 
> framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as 
> Spark 1.5.1) over Mesos will not delete some temporary folders causing free 
> disk space on server to exhaust. 
> Behavior of M/R job using Spark 1.4.1 over Mesos cluster:
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/*  , 
>  */tmp/spark-#/blockmgr-#*
> - When task is completed */tmp/spark-#/* gets deleted along with 
> */tmp/spark-#/blockmgr-#* sub-folder.
> Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job):
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/mesos/slaves/id** * , 
> */tmp/spark-***/ *  ,{color:red} /tmp/blockmgr-***{color}
> - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle 
> container folder {color:red} /tmp/blockmgr-***{color}
> Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several 
> GB depending on the job that ran. Over time this causes disk space to become 
> full with consequences that we all know. 
> Running a shell script would probably work but it is difficult to identify 
> folders in use by a running M/R or stale folders. I did notice similar issues 
> opened by other users marked as "resolved", but none seems to exactly match 
> the above behavior. 
> I really hope someone has insights on how to fix it.
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12477) [SQL] Tungsten projection fails for null values in array fields

2015-12-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12477:

Assignee: Pierre Borckmans  (was: Apache Spark)

> [SQL] Tungsten projection fails for null values in array fields
> ---
>
> Key: SPARK-12477
> URL: https://issues.apache.org/jira/browse/SPARK-12477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Pierre Borckmans
>Assignee: Pierre Borckmans
> Fix For: 1.5.3, 1.6.1, 2.0.0
>
>
> Accessing null elements in an array field fails when tungsten is enabled.
> It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled.
> Example:
> {code}
> // Array of String
> case class AS( as: Seq[String] )
> val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
> dfAS.registerTempTable("T_AS")
> for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from 
> T_AS").collect.mkString(","))}
> {code}
> With Tungsten disabled:
> {code}
> 0 = [a]
> 1 = [null]
> 2 = [b]
> {code}
> With Tungsten enabled:
> {code}
> 0 = [a]
> 15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15)
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
> at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> {code}
> --
> More examples below.
> The following code works in Spark 1.3.1, and in Spark > 1.5 with Tungsten 
> disabled:
> {code}
> // Array of String
> case class AS( as: Seq[String] )
> val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
> dfAS.registerTempTable("T_AS")
> for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select as[$i] from 
> T_AS").collect.mkString(","))}
> // Array of Int
> case class AI( ai: Seq[Option[Int]] )
> val dfAI = sc.parallelize( Seq( AI ( Seq(Some(1),None,Some(2) ) ) ) ).toDF
> dfAI.registerTempTable("T_AI")
> for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select ai[$i] from 
> T_AI").collect.mkString(","))}
> // Array of struct[Int,String]
> case class B(x: Option[Int], y: String)
> case class A( b: Seq[B] )
> val df1 = sc.parallelize( Seq( A ( Seq( B(Some(1),"a"),B(Some(2),"b"), 
> B(None, "c"), B(Some(4),null), B(None,null), null ) ) ) ).toDF
> df1.registerTempTable("T1")
> val df2 = sc.parallelize( Seq( A ( Seq( B(Some(1),"a"),B(Some(2),"b"), 
> B(None, "c"), B(Some(4),null), B(None,null), null ) ), A(null) ) ).toDF
> df2.registerTempTable("T2")
> for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select b[$i].x, 
> b[$i].y from T1").collect.mkString(","))}
> for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select b[$i].x, 
> b[$i].y from T2").collect.mkString(","))}
> // Struct[Int,String]
> case class C(b: B)
> val df3 = sc.parallelize( Seq( C ( B(Some(1),"test") ), C(null) ) ).toDF
> df3.registerTempTable("T3")
> sqlContext.sql("select b.x, b.y from T3").collect
> {code}
> With Tungsten enabled, it reaches NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12477) [SQL] Tungsten projection fails for null values in array fields

2015-12-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12477.
-
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.1
   1.5.3

> [SQL] Tungsten projection fails for null values in array fields
> ---
>
> Key: SPARK-12477
> URL: https://issues.apache.org/jira/browse/SPARK-12477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Pierre Borckmans
>Assignee: Apache Spark
> Fix For: 1.5.3, 1.6.1, 2.0.0
>
>
> Accessing null elements in an array field fails when tungsten is enabled.
> It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled.
> Example:
> {code}
> // Array of String
> case class AS( as: Seq[String] )
> val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
> dfAS.registerTempTable("T_AS")
> for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from 
> T_AS").collect.mkString(","))}
> {code}
> With Tungsten disabled:
> {code}
> 0 = [a]
> 1 = [null]
> 2 = [b]
> {code}
> With Tungsten enabled:
> {code}
> 0 = [a]
> 15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15)
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
> at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> {code}
> --
> More examples below.
> The following code works in Spark 1.3.1, and in Spark > 1.5 with Tungsten 
> disabled:
> {code}
> // Array of String
> case class AS( as: Seq[String] )
> val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF
> dfAS.registerTempTable("T_AS")
> for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select as[$i] from 
> T_AS").collect.mkString(","))}
> // Array of Int
> case class AI( ai: Seq[Option[Int]] )
> val dfAI = sc.parallelize( Seq( AI ( Seq(Some(1),None,Some(2) ) ) ) ).toDF
> dfAI.registerTempTable("T_AI")
> for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select ai[$i] from 
> T_AI").collect.mkString(","))}
> // Array of struct[Int,String]
> case class B(x: Option[Int], y: String)
> case class A( b: Seq[B] )
> val df1 = sc.parallelize( Seq( A ( Seq( B(Some(1),"a"),B(Some(2),"b"), 
> B(None, "c"), B(Some(4),null), B(None,null), null ) ) ) ).toDF
> df1.registerTempTable("T1")
> val df2 = sc.parallelize( Seq( A ( Seq( B(Some(1),"a"),B(Some(2),"b"), 
> B(None, "c"), B(Some(4),null), B(None,null), null ) ), A(null) ) ).toDF
> df2.registerTempTable("T2")
> for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select b[$i].x, 
> b[$i].y from T1").collect.mkString(","))}
> for (i <- 0 to 10) { println(i + " = " + sqlContext.sql(s"select b[$i].x, 
> b[$i].y from T2").collect.mkString(","))}
> // Struct[Int,String]
> case class C(b: B)
> val df3 = sc.parallelize( Seq( C ( B(Some(1),"test") ), C(null) ) ).toDF
> df3.registerTempTable("T3")
> sqlContext.sql("select b.x, b.y from T3").collect
> {code}
> With Tungsten enabled, it reaches NullPointerException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12494) Array out of bound Exception in KMeans Yarn Mode

2015-12-22 Thread Anandraj (JIRA)

Anandraj created SPARK-12494:


 Summary: Array out of bound Exception in KMeans Yarn Mode
 Key: SPARK-12494
 URL: https://issues.apache.org/jira/browse/SPARK-12494
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Anandraj
Priority: Blocker


Hi,

I am try to run k-means clustering on the word2vec data. I tested the code in 
local mode with small data. Clustering completes fine. But, when I run with 
same data on Yarn Cluster mode, it fails below error. 

15/12/23 00:49:01 ERROR yarn.ApplicationMaster: User class threw exception: 
java.lang.ArrayIndexOutOfBoundsException: 0
java.lang.ArrayIndexOutOfBoundsException: 0
at 
scala.collection.mutable.WrappedArray$ofRef.apply(WrappedArray.scala:126)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:377)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:377)
at scala.Array$.tabulate(Array.scala:331)
at 
org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:377)
at 
org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:249)
at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:213)
at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:520)
at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:531)
at 
com.tempurer.intelligence.adhocjobs.spark.kMeans$delayedInit$body.apply(kMeans.scala:24)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at 
com.tempurer.intelligence.adhocjobs.spark.kMeans$.main(kMeans.scala:9)
at com.tempurer.intelligence.adhocjobs.spark.kMeans.main(kMeans.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525)
15/12/23 00:49:01 INFO yarn.ApplicationMaster: Final app status: FAILED, 
exitCode: 15, (reason: User class threw exception: 
java.lang.ArrayIndexOutOfBoundsException: 0)

In Local mode with large data(2375849 vectors of size 200) , the first sampling 
stage completes. Second stage suspends execution without any error message. No 
Active execution in progress. I could only see the below warning message

15/12/23 01:24:13 INFO TaskSetManager: Finished task 9.0 in stage 1.0 (TID 37) 
in 29 ms on localhost (4/34)
15/12/23 01:24:14 WARN SparkContext: Requesting executors is only supported in 
coarse-grained mode
15/12/23 01:24:14 WARN ExecutorAllocationManager: Unable to reach the cluster 
manager to request 2 total executors!
15/12/23 01:24:15 WARN SparkContext: Requesting executors is only supported in 
coarse-grained mode
15/12/23 01:24:15 WARN ExecutorAllocationManager: Unable to reach the cluster 
manager to request 3 total executors!
15/12/23 01:24:16 WARN SparkContext: Requesting executors is only supported in 
coarse-grained mode
15/12/23 01:24:16 WARN ExecutorAllocationManager: Unable to reach the cluster 
manager to request 4 total executors!
15/12/23 01:24:17 WARN SparkContext: Requesting executors is only supported in 
coarse-grained mode
15/12/23 01:24:17 WARN ExecutorAllocationManager: Unable to reach the cluster 
manager to request 5 total executors!
15/12/23 01:24:18 WARN SparkContext: Requesting executors is only supported in 
coarse-grained mode
15/12/23 01:24:18 WARN ExecutorAllocationManager: Unable to reach the cluster 
manager to request 6 total executors!
15/12/23 01:24:19 WARN SparkContext: Requesting executors is only supported in 
coarse-grained mode
15/12/23 01:24:19 WARN ExecutorAllocationManager: Unable to reach the cluster 
manager to request 7 total executors!
15/12/23 01:24:20 WARN SparkContext: Requesting executors is only supported in 
coarse-grained mode
15/12/23 01:24:20 WARN ExecutorAllocationManager: Unable to reach the cluster 
manager to request 8 total executors!
15/12/23 01:24:21 WARN SparkContext: Requesting executors is only supported in 
coarse-grained mode
15/12/23 01:24:21 WARN ExecutorAllocationManager: Unable to reach the cluster 
manager to request 9 total executors!
15/12/23 01:24:22 WARN SparkContext: Requesting executors is only suppor

[jira] [Updated] (SPARK-12492) SQL page of Spark-sql is always blank

2015-12-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12492:

Component/s: Web UI

> SQL page of Spark-sql is always blank 
> --
>
> Key: SPARK-12492
> URL: https://issues.apache.org/jira/browse/SPARK-12492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>
> When I run a sql query in spark-sql, the Execution page of SQL tab is always 
> blank. But the JDBCServer is not blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12493) Can't open "details" span of ExecutionsPage in IE11

2015-12-22 Thread meiyoula (JIRA)

meiyoula created SPARK-12493:


 Summary: Can't open "details" span of ExecutionsPage in IE11
 Key: SPARK-12493
 URL: https://issues.apache.org/jira/browse/SPARK-12493
 Project: Spark
  Issue Type: Bug
Reporter: meiyoula
 Attachments: screenshot-1.png





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12493) Can't open "details" span of ExecutionsPage in IE11

2015-12-22 Thread meiyoula (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-12493:
-
Component/s: Web UI

> Can't open "details" span of ExecutionsPage in IE11
> ---
>
> Key: SPARK-12493
> URL: https://issues.apache.org/jira/browse/SPARK-12493
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12493) Can't open "details" span of ExecutionsPage in IE11

2015-12-22 Thread meiyoula (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-12493:
-
Attachment: screenshot-1.png

> Can't open "details" span of ExecutionsPage in IE11
> ---
>
> Key: SPARK-12493
> URL: https://issues.apache.org/jira/browse/SPARK-12493
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12492) SQL page of Spark-sql is always blank

2015-12-22 Thread meiyoula (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-12492:
-
Attachment: screenshot-1.png

> SQL page of Spark-sql is always blank 
> --
>
> Key: SPARK-12492
> URL: https://issues.apache.org/jira/browse/SPARK-12492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>
> When I run a sql query in spark-sql, the Execution page of SQL tab is always 
> blank. But the JDBCServer is not blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12492) SQL page of Spark-sql is always blank

2015-12-22 Thread meiyoula (JIRA)

meiyoula created SPARK-12492:


 Summary: SQL page of Spark-sql is always blank 
 Key: SPARK-12492
 URL: https://issues.apache.org/jira/browse/SPARK-12492
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: meiyoula


When I run a sql query in spark-sql, the Execution page of SQL tab is always 
blank. But the JDBCServer is not blank.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11164) Add InSet pushdown filter back for Parquet

2015-12-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-11164.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10278
[https://github.com/apache/spark/pull/10278]

> Add InSet pushdown filter back for Parquet
> --
>
> Key: SPARK-11164
> URL: https://issues.apache.org/jira/browse/SPARK-11164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11164) Add InSet pushdown filter back for Parquet

2015-12-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11164:
---
Assignee: Xiao Li

> Add InSet pushdown filter back for Parquet
> --
>
> Key: SPARK-11164
> URL: https://issues.apache.org/jira/browse/SPARK-11164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Xiao Li
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12385) Push projection into Join

2015-12-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069159#comment-15069159
 ] 

Xiao Li commented on SPARK-12385:
-

After the combining, I believe we can see the performance gain. Also confirmed 
from RDBMS people that this is normal in the RDBMS compiler. 

> Push projection into Join
> -
>
> Key: SPARK-12385
> URL: https://issues.apache.org/jira/browse/SPARK-12385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> We usually have  Join followed by a projection to pruning some columns, but 
> Join already have a result projection to produce UnsafeRow, we should combine 
> them together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12385) Push projection into Join

2015-12-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069133#comment-15069133
 ] 

Xiao Li commented on SPARK-12385:
-

I will work on it, if nobody starts it. Thanks!

> Push projection into Join
> -
>
> Key: SPARK-12385
> URL: https://issues.apache.org/jira/browse/SPARK-12385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> We usually have  Join followed by a projection to pruning some columns, but 
> Join already have a result projection to produce UnsafeRow, we should combine 
> them together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-12-22 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069101#comment-15069101
 ] 

Cody Koeninger commented on SPARK-11045:


Direct stream isn't doing any caching unless you specifically ask for it,
in which case you can set storage level
On Dec 22, 2015 9:30 PM, "Balaji Ramamoorthy (JIRA)" 



> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of message can never give any deterministic 
> size of your block hence there is no guarantee that Memory Back-Pressure can 
> really take affect. 
> 3. This consumer is using Kafka low level API which gives better performance 
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache 
> spark project , we will give community a better options for Kafka 
> connectivity both for Receiver less and Receiver based model. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12479:


Assignee: Apache Spark

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: R, SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
>Assignee: Apache Spark
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12479:


Assignee: (was: Apache Spark)

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: R, SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2015-12-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069092#comment-15069092
 ] 

Apache Spark commented on SPARK-12479:
--

User 'paulomagalhaes' has created a pull request for this issue:
https://github.com/apache/spark/pull/10436

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: R, SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12479:


Assignee: (was: Apache Spark)

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: R, SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12479:


Assignee: Apache Spark

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: R, SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
>Assignee: Apache Spark
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2015-12-22 Thread Balaji Ramamoorthy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069091#comment-15069091
 ] 

Balaji Ramamoorthy commented on SPARK-11045:


[~dibbhatt] We are looking for a high throughput kafka  consumer and 
exactly-once is not a priority. I also could not figure out how to set the 
StorageLevel in KafkaDirect to MEMORY_ONLY.  Since the low-level receiver based 
consumer supports everything i am looking for, i am curious to know how much 
performance improvement does it provide over KafkaDirect? Did you get a chance 
to do any bench-marking ?


> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of message can never give any deterministic 
> size of your block hence there is no guarantee that Memory Back-Pressure can 
> really take affect. 
> 3. This consumer is using Kafka low level API which gives better performance 
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache 
> spark project , we will give community a better options for Kafka 
> connectivity both for Receiver less and Receiver based model. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12468:


Assignee: Apache Spark

> getParamMap in Pyspark ML API returns empty dictionary in example for 
> Documentation
> ---
>
> Key: SPARK-12468
> URL: https://issues.apache.org/jira/browse/SPARK-12468
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Zachary Brown
>Assignee: Apache Spark
>Priority: Minor
>
> The `extractParamMap()` method for a model that has been fit returns an empty 
> dictionary, e.g. (from the [Pyspark ML API 
> Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)):
> ```python
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.param import Param, Params
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique IDs 
> for this
> # LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12468:


Assignee: (was: Apache Spark)

> getParamMap in Pyspark ML API returns empty dictionary in example for 
> Documentation
> ---
>
> Key: SPARK-12468
> URL: https://issues.apache.org/jira/browse/SPARK-12468
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Zachary Brown
>Priority: Minor
>
> The `extractParamMap()` method for a model that has been fit returns an empty 
> dictionary, e.g. (from the [Pyspark ML API 
> Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)):
> ```python
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.param import Param, Params
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique IDs 
> for this
> # LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12468:


Assignee: Apache Spark

> getParamMap in Pyspark ML API returns empty dictionary in example for 
> Documentation
> ---
>
> Key: SPARK-12468
> URL: https://issues.apache.org/jira/browse/SPARK-12468
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Zachary Brown
>Assignee: Apache Spark
>Priority: Minor
>
> The `extractParamMap()` method for a model that has been fit returns an empty 
> dictionary, e.g. (from the [Pyspark ML API 
> Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)):
> ```python
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.param import Param, Params
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique IDs 
> for this
> # LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12468:


Assignee: (was: Apache Spark)

> getParamMap in Pyspark ML API returns empty dictionary in example for 
> Documentation
> ---
>
> Key: SPARK-12468
> URL: https://issues.apache.org/jira/browse/SPARK-12468
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Zachary Brown
>Priority: Minor
>
> The `extractParamMap()` method for a model that has been fit returns an empty 
> dictionary, e.g. (from the [Pyspark ML API 
> Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)):
> ```python
> from pyspark.mllib.linalg import Vectors
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.param import Param, Params
> # Prepare training data from a list of (label, features) tuples.
> training = sqlContext.createDataFrame([
> (1.0, Vectors.dense([0.0, 1.1, 0.1])),
> (0.0, Vectors.dense([2.0, 1.0, -1.0])),
> (0.0, Vectors.dense([2.0, 1.3, 1.0])),
> (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
> # Create a LogisticRegression instance. This instance is an Estimator.
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> # Print out the parameters, documentation, and any default values.
> print "LogisticRegression parameters:\n" + lr.explainParams() + "\n"
> # Learn a LogisticRegression model. This uses the parameters stored in lr.
> model1 = lr.fit(training)
> # Since model1 is a Model (i.e., a transformer produced by an Estimator),
> # we can view the parameters it used during fit().
> # This prints the parameter (name: value) pairs, where names are unique IDs 
> for this
> # LogisticRegression instance.
> print "Model 1 was fit using parameters: "
> print model1.extractParamMap()
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069061#comment-15069061
 ] 

Andrew Davidson commented on SPARK-12484:
-

Hi Xiao

thanks for looking into this

Andy

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069055#comment-15069055
 ] 

Xiao Li commented on SPARK-12483:
-

Hi, Andy, 

See this example: 
{code}
DataFrame df = hc.range(0, 100).unionAll(hc.range(0, 
100)).select(col("id").as("value"));
{code}

Good luck, 

Xiao

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10486) Spark intermittently fails to recover from a worker failure (in standalone mode)

2015-12-22 Thread Liang Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069054#comment-15069054
 ] 

Liang Chen commented on SPARK-10486:


I meet the same problem

> Spark intermittently fails to recover from a worker failure (in standalone 
> mode)
> 
>
> Key: SPARK-10486
> URL: https://issues.apache.org/jira/browse/SPARK-10486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Cheuk Lam
>Priority: Critical
>
> We have run into a problem where some Spark job is aborted after one worker 
> is killed in a 2-worker standalone cluster.  The problem is intermittent, but 
> we can consistently reproduce it.  The problem only appears to happen when we 
> kill a worker.  It doesn't seem to happen when we kill an executor directly.
> The program we use to reproduce the problem is some iterative program based 
> on GraphX, although the nature of the issue doesn't seem to be GraphX 
> related.  This is how we reproduce the problem:
> * Set up a standalone cluster of 2 workers;
> * Run a Spark application of some iterative program (ours is some based on 
> GraphX);
> * Kill a worker process (and thus the associated executor);
> * Intermittently some job will be aborted.
> The driver and the executor logs are available, as well as the application 
> history (event log file).  But they are quite large and can't be attached 
> here.
> ~
> After looking into the log files, we think the failure is caused by the 
> following two things combined:
> * The BlockManagerMasterEndpoint in the driver has some stale block info 
> corresponding to the dead executor after the worker has been killed.  The 
> driver does appear to handle the "RemoveExecutor" message and cleans up all 
> related block info.  But subsequently, and intermittently, it receives some 
> Akka messages to re-register the dead BlockManager and re-add some of its 
> blocks.  As a result, upon GetLocations requests from the remaining executor, 
> the driver responds with some stale block info, instructing the remaining 
> executor to fetch blocks from the dead executor.  Please see the driver log 
> excerption below that shows the sequence of events described above.  In the 
> log, there are two executors: 1.2.3.4 was the one which got shut down, while 
> 5.6.7.8 is the remaining executor.  The driver also ran on 5.6.7.8.
> * When the remaining executor's BlockManager issues a doGetRemote() call to 
> fetch the block of data, it fails because the targeted BlockManager which 
> resided in the dead executor is gone.  This failure results in an exception 
> forwarded to the caller, bypassing the mechanism in the doGetRemote() 
> function to trigger a re-computation of the block.  I don't know whether that 
> is intentional or not.
> Driver log excerption that shows that the driver received messages to 
> re-register the dead executor after handling the RemoveExecutor message:
> 11690 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
> AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message 
> (172.236378 ms) 
> AkkaMessage(RegisterExecutor(0,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/Executor#670388190]),1.2.3.4:36140,8,Map(stdout
>  -> 
> http://1.2.3.4:8081/logPage/?appId=app-20150902203512-&executorId=0&logType=stdout,
>  stderr -> 
> http://1.2.3.4:8081/logPage/?appId=app-20150902203512-&executorId=0&logType=stderr)),true)
>  from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$f]
> 11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
> AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message 
> AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
> 52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)
>  from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$g]
> 11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
> AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: 
> AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
> 52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)
> 11718 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] INFO 
> BlockManagerMasterEndpoint: Registering block manager 1.2.3.4:52615 with 6.2 
> GB RAM, BlockManagerId(0, 1.2.3.4, 52615)
> 11719 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
> AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message 
> (1.498313 ms) AkkaMessage(Regi

[jira] [Commented] (SPARK-12478) Dataset fields of product types can't be null

2015-12-22 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069044#comment-15069044
 ] 

Cheng Lian commented on SPARK-12478:


I'm leaving this ticket open since we also need to backport this to branch-1.6 
after the release.

> Dataset fields of product types can't be null
> -
>
> Key: SPARK-12478
> URL: https://issues.apache.org/jira/browse/SPARK-12478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>  Labels: backport-needed
>
> Spark shell snippet for reproduction:
> {code}
> import sqlContext.implicits._
> case class Inner(f: Int)
> case class Outer(i: Inner)
> Seq(Outer(null)).toDS().toDF().show()
> Seq(Outer(null)).toDS().show()
> {code}
> Expected output should be:
> {noformat}
> ++
> |   i|
> ++
> |null|
> ++
> ++
> |   i|
> ++
> |null|
> ++
> {noformat}
> Actual output:
> {noformat}
> +--+
> | i|
> +--+
> |[null]|
> +--+
> java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: 
> Null value appeared in non-nullable field Inner.f of type scala.Int. If the 
> schema is inferred from a Scala tuple/case class, or a Java bean, please try 
> to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
> instead of int/scala.Int).
> newinstance(class $iwC$$iwC$Outer,if (isnull(input[0, 
> StructType(StructField(f,IntegerType,false))])) null else newinstance(class 
> $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)),false,ObjectType(class 
> $iwC$$iwC$Outer),Some($iwC$$iwC@6ab35ce3))
> +- if (isnull(input[0, StructType(StructField(f,IntegerType,false))])) null 
> else newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>:- isnull(input[0, StructType(StructField(f,IntegerType,false))])
>:  +- input[0, StructType(StructField(f,IntegerType,false))]
>:- null
>+- newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>   +- assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int)
>  +- input[0, StructType(StructField(f,IntegerType,false))].f
> +- input[0, StructType(StructField(f,IntegerType,false))]
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:224)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.Dataset.collect(Dataset.scala:704)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:725)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:240)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:230)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:193)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:201)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:46)
> at $iwC$$iwC$$iwC$$iwC.(:48)
> at $iwC$$iwC$$iwC.(:50)
> at $iwC$$iwC.(:52)
> at $iwC.(:54)
> at (:56)
> at .(:60)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> a

[jira] [Updated] (SPARK-12478) Dataset fields of product types can't be null

2015-12-22 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12478:
---
Labels: backport-needed  (was: )

> Dataset fields of product types can't be null
> -
>
> Key: SPARK-12478
> URL: https://issues.apache.org/jira/browse/SPARK-12478
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>  Labels: backport-needed
>
> Spark shell snippet for reproduction:
> {code}
> import sqlContext.implicits._
> case class Inner(f: Int)
> case class Outer(i: Inner)
> Seq(Outer(null)).toDS().toDF().show()
> Seq(Outer(null)).toDS().show()
> {code}
> Expected output should be:
> {noformat}
> ++
> |   i|
> ++
> |null|
> ++
> ++
> |   i|
> ++
> |null|
> ++
> {noformat}
> Actual output:
> {noformat}
> +--+
> | i|
> +--+
> |[null]|
> +--+
> java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: 
> Null value appeared in non-nullable field Inner.f of type scala.Int. If the 
> schema is inferred from a Scala tuple/case class, or a Java bean, please try 
> to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
> instead of int/scala.Int).
> newinstance(class $iwC$$iwC$Outer,if (isnull(input[0, 
> StructType(StructField(f,IntegerType,false))])) null else newinstance(class 
> $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0)),false,ObjectType(class 
> $iwC$$iwC$Outer),Some($iwC$$iwC@6ab35ce3))
> +- if (isnull(input[0, StructType(StructField(f,IntegerType,false))])) null 
> else newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>:- isnull(input[0, StructType(StructField(f,IntegerType,false))])
>:  +- input[0, StructType(StructField(f,IntegerType,false))]
>:- null
>+- newinstance(class $iwC$$iwC$Inner,assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int),false,ObjectType(class
>  $iwC$$iwC$Inner),Some($iwC$$iwC@6616b9e0))
>   +- assertnotnull(input[0, 
> StructType(StructField(f,IntegerType,false))].f,Inner,f,scala.Int)
>  +- input[0, StructType(StructField(f,IntegerType,false))].f
> +- input[0, StructType(StructField(f,IntegerType,false))]
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:224)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> org.apache.spark.sql.Dataset$$anonfun$collect$2.apply(Dataset.scala:704)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.Dataset.collect(Dataset.scala:704)
> at org.apache.spark.sql.Dataset.take(Dataset.scala:725)
> at org.apache.spark.sql.Dataset.showString(Dataset.scala:240)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:230)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:193)
> at org.apache.spark.sql.Dataset.show(Dataset.scala:201)
> at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:46)
> at $iwC$$iwC$$iwC$$iwC.(:48)
> at $iwC$$iwC$$iwC.(:50)
> at $iwC$$iwC.(:52)
> at $iwC.(:54)
> at (:56)
> at .(:60)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1045)
> at 
> org.apache.spark.repl.Sp

[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069027#comment-15069027
 ] 

Andrew Davidson commented on SPARK-12483:
-

Hi Xiao

thanks for looking at the issue

is there a way to change a column name? If you do a select() using a data 
frame, the column name is really strange

see attachement for https://issues.apache.org/jira/browse/SPARK-12484

 // get column from data frame call df.withColumnName
Column newCol = udfDF.col("_c0"); 

renaming data frame columns is very common in R

Kind regards

Andy

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069025#comment-15069025
 ] 

Xiao Li commented on SPARK-12484:
-

Please check the email answer by [~zjffdu] Thanks!

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069020#comment-15069020
 ] 

Xiao Li commented on SPARK-12483:
-

That means, it is not a bug. Thanks!

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069019#comment-15069019
 ] 

Xiao Li commented on SPARK-12483:
-

{code}as{code} is used to return a new DataFrame with an alias set. For example,

{code}
val x = testData2.as("x")
val y = testData2.as("y")
val join = x.join(y, $"x.a" === $"y.a", 
"inner").queryExecution.optimizedPlan
{code}

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12220) Make Utils.fetchFile support files that contain special characters

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12220:
-
Fix Version/s: (was: 1.6.1)
   (was: 2.0.0)
   1.6.0

> Make Utils.fetchFile support files that contain special characters
> --
>
> Key: SPARK-12220
> URL: https://issues.apache.org/jira/browse/SPARK-12220
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>
> Now if a file name contains some illegal characters, such as " ", 
> Utils.fetchFile will fail because it doesn't handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11749) Duplicate creating the RDD in file stream when recovering from checkpoint data

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-11749:
-
Affects Version/s: (was: 1.5.0)
   1.6.0

> Duplicate creating the RDD in file stream when recovering from checkpoint data
> --
>
> Key: SPARK-11749
> URL: https://issues.apache.org/jira/browse/SPARK-11749
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Jack Hu
>Assignee: Jack Hu
> Fix For: 1.6.0
>
>
> I have a case to monitor a HDFS folder, then enrich the incoming data from 
> the HDFS folder via different table (about 15 reference tables) and send to 
> different hive table after some operations. 
> The code is as this:
> {code}
> val txt = 
> ssc.textFileStream(folder).map(toKeyValuePair).reduceByKey(removeDuplicates)
> val refTable1 = 
> ssc.textFileStream(refSource1).map(parse(_)).updateStateByKey(...)
> txt.join(refTable1).map(..).reduceByKey(...).foreachRDD(
>   rdd => {
>  // insert into hive table
>   }
> )
> val refTable2 = 
> ssc.textFileStream(refSource2).map(parse(_)).updateStateByKey(...)
> txt.join(refTable2).map(..).reduceByKey(...).foreachRDD(
>   rdd => {
>  // insert into hive table
>   }
> )
> /// more refTables in following code
> {code}
>  
> The {{batchInterval}} of this application is set to *30 seconds*, the 
> checkpoint interval is set to *10 minutes*, every batch in {{txt}} has *60 
> files*
> After recovered from checkpoint data, I can see lots of log to create the RDD 
> in file stream: rdd in each batch of file stream was been recreated *15 
> times*, and it takes about *5 minutes* to create so much file RDD. During 
> this period, *10K+ broadcast* had been created and almost used all the block 
> manager space. 
> After some investigation, we found that the {{DStream.restoreCheckpointData}} 
> would be invoked at each output ({{DStream.foreachRDD}} in this case), and no 
> flag to indicate that this {{DStream}} had been restored, so the RDD in file 
> stream was been recreated. 
> Suggest to add on flag to control the restore process to avoid the duplicated 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12376) Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12376:
-
Fix Version/s: (was: 1.6.1)
   (was: 2.0.0)
   1.6.0

> Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method
> 
>
> Key: SPARK-12376
> URL: https://issues.apache.org/jira/browse/SPARK-12376
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
> Environment: Oracle Java 64-bit (build 1.8.0_66-b17)
>Reporter: Evan Chen
>Assignee: Evan Chen
>Priority: Minor
> Fix For: 1.6.0
>
>
> org.apache.spark.streaming.Java8APISuite.java is failing due to trying to 
> sort immutable list in assertOrderInvariantEquals method.
> Here are the errors:
> Tests run: 27, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 5.948 sec 
> <<< FAILURE! - in org.apache.spark.streaming.Java8APISuite
> testMap(org.apache.spark.streaming.Java8APISuite)  Time elapsed: 0.217 sec  
> <<< ERROR!
> java.lang.UnsupportedOperationException: null
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at java.util.AbstractList$ListItr.set(AbstractList.java:426)
>   at java.util.List.sort(List.java:482)
>   at java.util.Collections.sort(Collections.java:141)
>   at 
> org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444)
> testFlatMap(org.apache.spark.streaming.Java8APISuite)  Time elapsed: 0.203 
> sec  <<< ERROR!
> java.lang.UnsupportedOperationException: null
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at java.util.AbstractList$ListItr.set(AbstractList.java:426)
>   at java.util.List.sort(List.java:482)
>   at java.util.Collections.sort(Collections.java:141)
>   at 
> org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444)
> testFilter(org.apache.spark.streaming.Java8APISuite)  Time elapsed: 0.209 sec 
>  <<< ERROR!
> java.lang.UnsupportedOperationException: null
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at java.util.AbstractList$ListItr.set(AbstractList.java:426)
>   at java.util.List.sort(List.java:482)
>   at java.util.Collections.sort(Collections.java:141)
>   at 
> org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444)
> testTransform(org.apache.spark.streaming.Java8APISuite)  Time elapsed: 0.215 
> sec  <<< ERROR!
> java.lang.UnsupportedOperationException: null
>   at java.util.AbstractList.set(AbstractList.java:132)
>   at java.util.AbstractList$ListItr.set(AbstractList.java:426)
>   at java.util.List.sort(List.java:482)
>   at java.util.Collections.sort(Collections.java:141)
>   at 
> org.apache.spark.streaming.Java8APISuite.lambda$assertOrderInvariantEquals$1(Java8APISuite.java:444)
> Results :
> Tests in error: 
>   
> Java8APISuite.testFilter:81->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444
>  » UnsupportedOperation
>   
> Java8APISuite.testFlatMap:360->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444
>  » UnsupportedOperation
>   
> Java8APISuite.testMap:63->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444
>  » UnsupportedOperation
>   
> Java8APISuite.testTransform:168->assertOrderInvariantEquals:444->lambda$assertOrderInvariantEquals$1:444
>  » UnsupportedOperation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11749) Duplicate creating the RDD in file stream when recovering from checkpoint data

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-11749:
-
Fix Version/s: (was: 1.6.1)
   (was: 2.0.0)
   1.6.0

> Duplicate creating the RDD in file stream when recovering from checkpoint data
> --
>
> Key: SPARK-11749
> URL: https://issues.apache.org/jira/browse/SPARK-11749
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Jack Hu
>Assignee: Jack Hu
> Fix For: 1.6.0
>
>
> I have a case to monitor a HDFS folder, then enrich the incoming data from 
> the HDFS folder via different table (about 15 reference tables) and send to 
> different hive table after some operations. 
> The code is as this:
> {code}
> val txt = 
> ssc.textFileStream(folder).map(toKeyValuePair).reduceByKey(removeDuplicates)
> val refTable1 = 
> ssc.textFileStream(refSource1).map(parse(_)).updateStateByKey(...)
> txt.join(refTable1).map(..).reduceByKey(...).foreachRDD(
>   rdd => {
>  // insert into hive table
>   }
> )
> val refTable2 = 
> ssc.textFileStream(refSource2).map(parse(_)).updateStateByKey(...)
> txt.join(refTable2).map(..).reduceByKey(...).foreachRDD(
>   rdd => {
>  // insert into hive table
>   }
> )
> /// more refTables in following code
> {code}
>  
> The {{batchInterval}} of this application is set to *30 seconds*, the 
> checkpoint interval is set to *10 minutes*, every batch in {{txt}} has *60 
> files*
> After recovered from checkpoint data, I can see lots of log to create the RDD 
> in file stream: rdd in each batch of file stream was been recreated *15 
> times*, and it takes about *5 minutes* to create so much file RDD. During 
> this period, *10K+ broadcast* had been created and almost used all the block 
> manager space. 
> After some investigation, we found that the {{DStream.restoreCheckpointData}} 
> would be invoked at each output ({{DStream.foreachRDD}} in this case), and no 
> flag to indicate that this {{DStream}} had been restored, so the RDD in file 
> stream was been recreated. 
> Suggest to add on flag to control the restore process to avoid the duplicated 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12386) Setting "spark.executor.port" leads to NPE in SparkEnv

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12386:
-
Target Version/s: 1.6.0  (was: 1.6.1, 2.0.0)

> Setting "spark.executor.port" leads to NPE in SparkEnv
> --
>
> Key: SPARK-12386
> URL: https://issues.apache.org/jira/browse/SPARK-12386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
> Fix For: 1.6.0
>
>
> From the list:
> {quote}
> when we set spark.executor.port in 1.6, we get thrown a NPE in 
> SparkEnv$.create(SparkEnv.scala:259).
> {quote}
> Fix is simple; probably should make it to 1.6.0 since it will affect anyone 
> using that config options, but I'll leave that to the release manager's 
> discretion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12386) Setting "spark.executor.port" leads to NPE in SparkEnv

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12386:
-
Fix Version/s: (was: 1.6.1)
   (was: 2.0.0)
   1.6.0

> Setting "spark.executor.port" leads to NPE in SparkEnv
> --
>
> Key: SPARK-12386
> URL: https://issues.apache.org/jira/browse/SPARK-12386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
> Fix For: 1.6.0
>
>
> From the list:
> {quote}
> when we set spark.executor.port in 1.6, we get thrown a NPE in 
> SparkEnv$.create(SparkEnv.scala:259).
> {quote}
> Fix is simple; probably should make it to 1.6.0 since it will affect anyone 
> using that config options, but I'll leave that to the release manager's 
> discretion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12410) "." and "|" used for String.split directly

2015-12-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12410:
-
Fix Version/s: (was: 1.6.1)
   (was: 2.0.0)
   1.6.0

> "." and "|" used for String.split directly
> --
>
> Key: SPARK-12410
> URL: https://issues.apache.org/jira/browse/SPARK-12410
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 1.4.2, 1.5.3, 1.6.0
>
>
> String.split accepts a regular expression, so we should escape "." and "|".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming

2015-12-22 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-12429:
--
Assignee: Shixiong Zhu  (was: Apache Spark)

> Update documentation to show how to use accumulators and broadcasts with 
> Spark Streaming
> 
>
> Key: SPARK-12429
> URL: https://issues.apache.org/jira/browse/SPARK-12429
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>
>  Accumulators and Broadcasts with Spark Streaming cannot work perfectly when 
> restarting on driver failures. We need to add some example to guide the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12429) Update documentation to show how to use accumulators and broadcasts with Spark Streaming

2015-12-22 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-12429.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Update documentation to show how to use accumulators and broadcasts with 
> Spark Streaming
> 
>
> Key: SPARK-12429
> URL: https://issues.apache.org/jira/browse/SPARK-12429
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>
>  Accumulators and Broadcasts with Spark Streaming cannot work perfectly when 
> restarting on driver failures. We need to add some example to guide the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12061) Persist for Map/filter with Lambda Functions don't always read from Cache

2015-12-22 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058419#comment-15058419
 ] 

Xiao Li edited comment on SPARK-12061 at 12/23/15 12:22 AM:


The logical plan's cleanArgs do not match when we calling sameResult. 


was (Author: smilegator):
Start working on it. Thanks!

> Persist for Map/filter with Lambda Functions don't always read from Cache
> -
>
> Key: SPARK-12061
> URL: https://issues.apache.org/jira/browse/SPARK-12061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, the existing caching mechanisms do not work on dataset operations 
> when using map/filter with lambda functions. For example, 
> {code}
>   test("persist and then map/filter with lambda functions") {
> val f = (i: Int) => i + 1
> val ds = Seq(1, 2, 3).toDS()
> val mapped = ds.map(f)
> mapped.cache()
> val mapped2 = ds.map(f)
> assertCached(mapped2)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12487) Add docs for Kafka message handler

2015-12-22 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-12487.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Add docs for Kafka message handler
> --
>
> Key: SPARK-12487
> URL: https://issues.apache.org/jira/browse/SPARK-12487
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-22 Thread Timothy Hunter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068908#comment-15068908
 ] 

Timothy Hunter commented on SPARK-12247:


It seems to me that the calculation of false positives is more relevant for the 
movie ratings, and that the RMSE right above in the example is already a good 
example to but. What do you think?

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis

2015-12-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12102:
-
Target Version/s:   (was: 1.6.0)

> Cast a non-nullable struct field to a nullable field during analysis
> 
>
> Key: SPARK-12102
> URL: https://issues.apache.org/jira/browse/SPARK-12102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Dilip Biswal
> Fix For: 2.0.0
>
>
> If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, 
> cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will 
> see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > 
> 0) THEN 
> struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4)
>  as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE 
> expressions should all be same type or coercible to a common type; line 1 pos 
> 85}}.
> The problem is the nullability difference between {{4}} (non-nullable) and 
> {{hash(4)}} (nullable).
> Seems it makes sense to cast the nullability in the analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis

2015-12-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12102.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10156
[https://github.com/apache/spark/pull/10156]

> Cast a non-nullable struct field to a nullable field during analysis
> 
>
> Key: SPARK-12102
> URL: https://issues.apache.org/jira/browse/SPARK-12102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
> Fix For: 2.0.0
>
>
> If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, 
> cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will 
> see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > 
> 0) THEN 
> struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4)
>  as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE 
> expressions should all be same type or coercible to a common type; line 1 pos 
> 85}}.
> The problem is the nullability difference between {{4}} (non-nullable) and 
> {{hash(4)}} (nullable).
> Seems it makes sense to cast the nullability in the analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12102) Cast a non-nullable struct field to a nullable field during analysis

2015-12-22 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12102:
-
Assignee: Dilip Biswal

> Cast a non-nullable struct field to a nullable field during analysis
> 
>
> Key: SPARK-12102
> URL: https://issues.apache.org/jira/browse/SPARK-12102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Dilip Biswal
> Fix For: 2.0.0
>
>
> If you try {{sqlContext.sql("select case when 1>0 then struct(1, 2, 3, 
> cast(hash(4) as int)) else struct(1, 2, 3, 4) end").printSchema}}, you will 
> see {{org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (1 > 
> 0) THEN 
> struct(1,2,3,cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(4)
>  as int)) ELSE struct(1,2,3,4)' due to data type mismatch: THEN and ELSE 
> expressions should all be same type or coercible to a common type; line 1 pos 
> 85}}.
> The problem is the nullability difference between {{4}} (non-nullable) and 
> {{hash(4)}} (nullable).
> Seems it makes sense to cast the nullability in the analysis. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12441) Fixing missingInput in all Logical/Physical operators

2015-12-22 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12441:

Summary: Fixing missingInput in all Logical/Physical operators  (was: 
Fixing missingInput in Generate/MapPartitions/AppendColumns/MapGroups/CoGroup)

> Fixing missingInput in all Logical/Physical operators
> -
>
> Key: SPARK-12441
> URL: https://issues.apache.org/jira/browse/SPARK-12441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Xiao Li
>
> The value of missingInput in 
> Generate/MapPartitions/AppendColumns/MapGroups/CoGroup is incorrect. 
> {code}
> val df = Seq((1, "a b c"), (2, "a b"), (3, "a")).toDF("number", "letters")
> val df2 =
>   df.explode('letters) {
> case Row(letters: String) => letters.split(" ").map(Tuple1(_)).toSeq
>   }
> df2.explain(true)
> {code}
> {code}
> == Parsed Logical Plan ==
> 'Generate UserDefinedGenerator('letters), true, false, None
> +- Project [_1#0 AS number#2,_2#1 AS letters#3]
>+- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]]
> == Analyzed Logical Plan ==
> number: int, letters: string, _1: string
> Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8]
> +- Project [_1#0 AS number#2,_2#1 AS letters#3]
>+- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]]
> == Optimized Logical Plan ==
> Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8]
> +- LocalRelation [number#2,letters#3], [[1,a b c],[2,a b],[3,a]]
> == Physical Plan ==
> !Generate UserDefinedGenerator(letters#3), true, false, 
> [number#2,letters#3,_1#8]
> +- LocalTableScan [number#2,letters#3], [[1,a b c],[2,a b],[3,a]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12491) UDAF result differs in SQL if alias is used

2015-12-22 Thread Tristan (JIRA)

Tristan created SPARK-12491:
---

 Summary: UDAF result differs in SQL if alias is used
 Key: SPARK-12491
 URL: https://issues.apache.org/jira/browse/SPARK-12491
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: Tristan


Using the GeometricMean UDAF example 
(https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html),
 I found the following discrepancy in results:

scala> sqlContext.sql("select group_id, gm(id) from simple group by 
group_id").show()
++---+
|group_id|_c1|
++---+
|   0|0.0|
|   1|0.0|
|   2|0.0|
++---+


scala> sqlContext.sql("select group_id, gm(id) as GeometricMean from simple 
group by group_id").show()
++-+
|group_id|GeometricMean|
++-+
|   0|8.981385496571725|
|   1|7.301716979342118|
|   2|7.706253151292568|
++-+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068827#comment-15068827
 ] 

Ilya Ganelin commented on SPARK-12488:
--

Further investigation identifies the issue as stemming from the docTermVector 
containing zero-vectors (as in no words from the vocabulary present in the 
document).

> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generates 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
> -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
> -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
> -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
> -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
> 38, 39, 40, 41, 42, 43, 44, 45, 4...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12490) Don't use Javascript for web UI's paginated table navigation controls

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12490:


Assignee: Apache Spark  (was: Josh Rosen)

> Don't use Javascript for web UI's paginated table navigation controls
> -
>
> Key: SPARK-12490
> URL: https://issues.apache.org/jira/browse/SPARK-12490
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> The web UI's paginated table uses Javascript to implement certain navigation 
> controls, such as table sorting and the "go to page" form. This is 
> unnecessary and should be simplified to use plain HTML form controls and 
> links.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12490) Don't use Javascript for web UI's paginated table navigation controls

2015-12-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068802#comment-15068802
 ] 

Apache Spark commented on SPARK-12490:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10441

> Don't use Javascript for web UI's paginated table navigation controls
> -
>
> Key: SPARK-12490
> URL: https://issues.apache.org/jira/browse/SPARK-12490
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The web UI's paginated table uses Javascript to implement certain navigation 
> controls, such as table sorting and the "go to page" form. This is 
> unnecessary and should be simplified to use plain HTML form controls and 
> links.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12490) Don't use Javascript for web UI's paginated table navigation controls

2015-12-22 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-12490:
--

 Summary: Don't use Javascript for web UI's paginated table 
navigation controls
 Key: SPARK-12490
 URL: https://issues.apache.org/jira/browse/SPARK-12490
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Josh Rosen
Assignee: Josh Rosen


The web UI's paginated table uses Javascript to implement certain navigation 
controls, such as table sorting and the "go to page" form. This is unnecessary 
and should be simplified to use plain HTML form controls and links.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12489) Fix minor issues found by Findbugs

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12489:


Assignee: Apache Spark

> Fix minor issues found by Findbugs
> --
>
> Key: SPARK-12489
> URL: https://issues.apache.org/jira/browse/SPARK-12489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core, SQL
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> Just used FindBugs to scan the codes and fixed some real issues:
> 1. Close `java.sql.Statement`
> 2. Fix incorrect `asInstanceOf`.
> 3. Remove unnecessary `synchronized` and `ReentrantLock`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12489) Fix minor issues found by Findbugs

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12489:


Assignee: (was: Apache Spark)

> Fix minor issues found by Findbugs
> --
>
> Key: SPARK-12489
> URL: https://issues.apache.org/jira/browse/SPARK-12489
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core, SQL
>Reporter: Shixiong Zhu
>Priority: Minor
>
> Just used FindBugs to scan the codes and fixed some real issues:
> 1. Close `java.sql.Statement`
> 2. Fix incorrect `asInstanceOf`.
> 3. Remove unnecessary `synchronized` and `ReentrantLock`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12489) Fix minor issues found by Findbugs

2015-12-22 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-12489:


 Summary: Fix minor issues found by Findbugs
 Key: SPARK-12489
 URL: https://issues.apache.org/jira/browse/SPARK-12489
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core, SQL
Reporter: Shixiong Zhu
Priority: Minor


Just used FindBugs to scan the codes and fixed some real issues:

1. Close `java.sql.Statement`
2. Fix incorrect `asInstanceOf`.
3. Remove unnecessary `synchronized` and `ReentrantLock`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner

2015-12-22 Thread Pete Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068693#comment-15068693
 ] 

Pete Robbins edited comment on SPARK-12470 at 12/22/15 9:47 PM:


I'm fairly sure the code in my PR is correct but it is causing an 
ExchangeCoordinatorSuite test to fail. I'm struggling to see why this test is 
failing with the change I made. The failure is:

 determining the number of reducers: aggregate operator *** FAILED ***
 3 did not equal 2 (ExchangeCoordinatorSuite.scala:316)

putting some debug into the test I see that before my change the pre-shuffle 
partition sizes are 600, 600, 600, 600, 600 an after my change are 800. 800. 
800. 800. 720 but I have no idea why. I'd really appreciate anyone with 
knowledge of this area a) checking my PR and b) helping explain the failing 
test.

EDIT Please ignore. Merged with latest head including changes for SPARK-12388 
now passes all tests


was (Author: robbinspg):
I'm fairly sure the code in my PR is correct but it is causing an 
ExchangeCoordinatorSuite test to fail. I'm struggling to see why this test is 
failing with the change I made. The failure is:

 determining the number of reducers: aggregate operator *** FAILED ***
 3 did not equal 2 (ExchangeCoordinatorSuite.scala:316)

putting some debug into the test I see that before my change the pre-shuffle 
partition sizes are 600, 600, 600, 600, 600 an after my change are 800. 800. 
800. 800. 720 but I have no idea why. I'd really appreciate anyone with 
knowledge of this area a) checking my PR and b) helping explain the failing 
test.

> Incorrect calculation of row size in 
> o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
> ---
>
> Key: SPARK-12470
> URL: https://issues.apache.org/jira/browse/SPARK-12470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Pete Robbins
>Priority: Minor
>
> While looking into https://issues.apache.org/jira/browse/SPARK-12319 I 
> noticed that the row size is incorrectly calculated.
> The "sizeReduction" value is calculated in words:
>// The number of words we can reduce when we concat two rows together.
> // The only reduction comes from merging the bitset portion of the two 
> rows, saving 1 word.
> val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords
> but then it is subtracted from the size of the row in bytes:
>|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - 
> $sizeReduction);
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068761#comment-15068761
 ] 

Ilya Ganelin edited comment on SPARK-12488 at 12/22/15 9:32 PM:


[~josephkb] Would love your feedback here. Thanks!


was (Author: ilganeli):
@jkbradley Would love your feedback here. Thanks!

> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generates 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
> -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
> -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
> -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
> -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
> 38, 39, 40, 41, 42, 43, 44, 45, 4...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068761#comment-15068761
 ] 

Ilya Ganelin commented on SPARK-12488:
--

@jkbradley Would love your feedback here. Thanks!

> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generates 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
> -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
> -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
> -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
> -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
> 38, 39, 40, 41, 42, 43, 44, 45, 4...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12471) Spark daemons should log their pid in the log file

2015-12-22 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12471.
-
   Resolution: Fixed
 Assignee: Nong Li  (was: Apache Spark)
Fix Version/s: 2.0.0

> Spark daemons should log their pid in the log file
> --
>
> Key: SPARK-12471
> URL: https://issues.apache.org/jira/browse/SPARK-12471
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 2.0.0
>
>
> This is useful when debugging from the log files without the processes 
> running. This information makes it possible to combine the log files with 
> other system information (e.g. dmesg output)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-12488:
-
Description: 
When running the LDA model, and using the describeTopics function, invalid 
values appear in the termID list that is returned:

The below example generates 10 topics on a data set with a vocabulary of 685.

{code}

// Set LDA parameters
val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)

val ldaModel = lda.run(docTermVector)
val distModel = 
ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
{code}

{code}
scala> ldaModel.describeTopics()(0)._1.sorted.reverse
res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 
585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571, 570, 
569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 556, 555, 554, 
553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 
537, 536, 535, 534, 533, 532, 53...
{code}

{code}
scala> ldaModel.describeTopics()(0)._1.sorted
res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
-1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
-1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
-1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
-1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
-1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
-1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
-563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
-26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 
39, 40, 41, 42, 43, 44, 45, 4...
{code}

  was:
When running the LDA model, and using the describeTopics function, invalid 
values appear in the termID list that is returned:

The below example generated 10 topics on a data set with a vocabulary of 685.

{code}

// Set LDA parameters
val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)

val ldaModel = lda.run(docTermVector)
val distModel = 
ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
{code}

{code}
scala> ldaModel.describeTopics()(0)._1.sorted.reverse
res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 
585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571, 570, 
569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 556, 555, 554, 
553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 
537, 536, 535, 534, 533, 532, 53...
{code}

{code}
scala> ldaModel.describeTopics()(0)._1.sorted
res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
-1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
-1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
-1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
-1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
-1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
-1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
-563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
-26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 
39, 40, 41, 42, 43, 44, 45, 4...
{code}


> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2

[jira] [Updated] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-12488:
-
Summary: LDA describeTopics() Generates Invalid Term IDs  (was: LDA 
Describe Topics Generates Invalid Term IDs)

> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generated 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
> -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
> -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
> -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
> -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
> 38, 39, 40, 41, 42, 43, 44, 45, 4...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12488) LDA Describe Topics Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-12488:
-
Description: 
When running the LDA model, and using the describeTopics function, invalid 
values appear in the termID list that is returned:

The below example generated 10 topics on a data set with a vocabulary of 685.

{code}

// Set LDA parameters
val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)

val ldaModel = lda.run(docTermVector)
val distModel = 
ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
{code}

{code}
scala> ldaModel.describeTopics()(0)._1.sorted.reverse
res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 
585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571, 570, 
569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 556, 555, 554, 
553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 541, 540, 539, 538, 
537, 536, 535, 534, 533, 532, 53...
{code}

{code}
scala> ldaModel.describeTopics()(0)._1.sorted
res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
-1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
-1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
-1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
-1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
-1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
-1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
-563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
-26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 
39, 40, 41, 42, 43, 44, 45, 4...
{code}

  was:
When running the LDA model, and using the describeTopics function, invalid 
values appear in the termID list that is returned:

{code}

// Set LDA parameters
val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)

val ldaModel = lda.run(docTermVector)
val distModel = 
ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
{code}


> LDA Describe Topics Generates Invalid Term IDs
> --
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generated 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -

[jira] [Commented] (SPARK-12487) Add docs for Kafka message handler

2015-12-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068748#comment-15068748
 ] 

Apache Spark commented on SPARK-12487:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/10439

> Add docs for Kafka message handler
> --
>
> Key: SPARK-12487
> URL: https://issues.apache.org/jira/browse/SPARK-12487
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12488) LDA Describe Topics Generates Invalid Term IDs

2015-12-22 Thread Ilya Ganelin (JIRA)

Ilya Ganelin created SPARK-12488:


 Summary: LDA Describe Topics Generates Invalid Term IDs
 Key: SPARK-12488
 URL: https://issues.apache.org/jira/browse/SPARK-12488
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.5.2
Reporter: Ilya Ganelin


When running the LDA model, and using the describeTopics function, invalid 
values appear in the termID list that is returned:

{code}

// Set LDA parameters
val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)

val ldaModel = lda.run(docTermVector)
val distModel = 
ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"

2015-12-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068731#comment-15068731
 ] 

Sean Owen commented on SPARK-12485:
---

I must say I don't think that's worth it. Both terms seem equally sound, but 
dynamic allocation has been widely used to describe the feature externally. 
Renaming it doesn't seem to add much of anything in comparison.

> Rename "dynamic allocation" to "elastic scaling"
> 
>
> Key: SPARK-12485
> URL: https://issues.apache.org/jira/browse/SPARK-12485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Fewer syllables, sounds more natural.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12487) Add docs for Kafka message handler

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12487:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Add docs for Kafka message handler
> --
>
> Key: SPARK-12487
> URL: https://issues.apache.org/jira/browse/SPARK-12487
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12487) Add docs for Kafka message handler

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12487:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Add docs for Kafka message handler
> --
>
> Key: SPARK-12487
> URL: https://issues.apache.org/jira/browse/SPARK-12487
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12487) Add docs for Kafka message handler

2015-12-22 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-12487:


 Summary: Add docs for Kafka message handler
 Key: SPARK-12487
 URL: https://issues.apache.org/jira/browse/SPARK-12487
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12486) Executors are not always terminated successfully by the worker.

2015-12-22 Thread Nong Li (JIRA)

Nong Li created SPARK-12486:
---

 Summary: Executors are not always terminated successfully by the 
worker.
 Key: SPARK-12486
 URL: https://issues.apache.org/jira/browse/SPARK-12486
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Nong Li


There are cases when the executor is not killed successfully by the worker. 
One way this can happen is if the executor is in a bad state, fails to 
heartbeat and the master tells the worker to kill the executor. The executor is 
in such a bad state that the kill request is ignored. This seems to be able to 
happen if the executor is in heavy GC.

The cause of this is that the Process.destroy() API is not forceful enough. In 
Java8, a new API, destroyForcibly() was added. We should use that if available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12414) Remove closure serializer

2015-12-22 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-12414:
-

Assignee: Andrew Or

> Remove closure serializer
> -
>
> Key: SPARK-12414
> URL: https://issues.apache.org/jira/browse/SPARK-12414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> There is a config `spark.closure.serializer` that accepts exactly one value: 
> the java serializer. This is because there are currently bugs in the Kryo 
> serializer that make it not a viable candidate. This was uncovered by an 
> unsuccessful attempt to make it work: SPARK-7708.
> My high level point is that the Java serializer has worked well for at least 
> 6 Spark versions now, and it is an incredibly complicated task to get other 
> serializers (not just Kryo) to work with Spark's closures. IMO the effort is 
> not worth it and we should just remove this documentation and all the code 
> associated with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"

2015-12-22 Thread Andrew Or (JIRA)

Andrew Or created SPARK-12485:
-

 Summary: Rename "dynamic allocation" to "elastic scaling"
 Key: SPARK-12485
 URL: https://issues.apache.org/jira/browse/SPARK-12485
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or


Fewer syllables, sounds more natural.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"

2015-12-22 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12485:
--
Target Version/s: 2.0.0

> Rename "dynamic allocation" to "elastic scaling"
> 
>
> Key: SPARK-12485
> URL: https://issues.apache.org/jira/browse/SPARK-12485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Fewer syllables, sounds more natural.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068698#comment-15068698
 ] 

Andrew Davidson commented on SPARK-12484:
-

releated issue https://issues.apache.org/jira/browse/SPARK-12483

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068696#comment-15068696
 ] 

Andrew Davidson commented on SPARK-12484:
-

What I am really trying to do is rewrite the following python code in Java. 
Ideally I would implement this code as a MLib.transformation how ever that does 
not seem possible at this point in time using the Java API

Kind regards

Andy

def convertMultinomialLabelToBinary(dataFrame):
newColName = "binomialLabel"
binomial = udf(lambda labelStr: labelStr if (labelStr == "noise") else 
"signal", StringType())
ret = dataFrame.withColumn(newColName, binomial(dataFrame["label"]))
return ret

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner

2015-12-22 Thread Pete Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068693#comment-15068693
 ] 

Pete Robbins commented on SPARK-12470:
--

I'm fairly sure the code in my PR is correct but it is causing an 
ExchangeCoordinatorSuite test to fail. I'm struggling to see why this test is 
failing with the change I made. The failure is:

 determining the number of reducers: aggregate operator *** FAILED ***
 3 did not equal 2 (ExchangeCoordinatorSuite.scala:316)

putting some debug into the test I see that before my change the pre-shuffle 
partition sizes are 600, 600, 600, 600, 600 an after my change are 800. 800. 
800. 800. 720 but I have no idea why. I'd really appreciate anyone with 
knowledge of this area a) checking my PR and b) helping explain the failing 
test.

> Incorrect calculation of row size in 
> o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
> ---
>
> Key: SPARK-12470
> URL: https://issues.apache.org/jira/browse/SPARK-12470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Pete Robbins
>Priority: Minor
>
> While looking into https://issues.apache.org/jira/browse/SPARK-12319 I 
> noticed that the row size is incorrectly calculated.
> The "sizeReduction" value is calculated in words:
>// The number of words we can reduce when we concat two rows together.
> // The only reduction comes from merging the bitset portion of the two 
> rows, saving 1 word.
> val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords
> but then it is subtracted from the size of the row in bytes:
>|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - 
> $sizeReduction);
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068684#comment-15068684
 ] 

Andrew Davidson commented on SPARK-12484:
-

you can find some more back ground on the email thread 'should I file a bug? 
Re: trouble implementing Transformer and calling DataFrame.withColumn()' 



> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12484:

Attachment: UDFTest.java

Add a unit test file

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-12484:
---

 Summary: DataFrame withColumn() does not work in Java
 Key: SPARK-12484
 URL: https://issues.apache.org/jira/browse/SPARK-12484
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
 Environment: mac El Cap. 10.11.2
Java 8
Reporter: Andrew Davidson


DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises

 org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
transformedByUDF#3];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-12-22 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068627#comment-15068627
 ] 

Maciej Bryński edited comment on SPARK-11437 at 12/22/15 8:24 PM:
--

[~davies], [~jason.white]
Are you sure that this patch is OK ?

In 1.6.0 if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}


was (Author: maver1ck):
[~davies], [~jason.white]
Are you sure that this patch is OK ?

Right now if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 1.6.0
>
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions

2015-12-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068670#comment-15068670
 ] 

Apache Spark commented on SPARK-12462:
--

User 'xguo27' has created a pull request for this issue:
https://github.com/apache/spark/pull/10437

> Add ExpressionDescription to misc non-aggregate functions
> -
>
> Key: SPARK-12462
> URL: https://issues.apache.org/jira/browse/SPARK-12462
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Attachment: SPARK_12483_unitTest.java

added a unit test file

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Comment: was deleted

(was: add a unit test file )

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Attachment: (was: SPARK_12483_unitTest.java)

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Attachment: SPARK_12483_unitTest.java

add a unit test file 

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-12483:
---

 Summary: Data Frame as() does not work in Java
 Key: SPARK-12483
 URL: https://issues.apache.org/jira/browse/SPARK-12483
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
 Environment: Mac El Cap 10.11.2
Java 8
Reporter: Andrew Davidson


Following unit test demonstrates a bug in as(). The column name for aliasDF was 
not changed

   @Test
public void bugDataFrameAsTest() {
DataFrame df = createData();
df.printSchema();
df.show();

 DataFrame aliasDF = df.select("id").as("UUID");
 aliasDF.printSchema();
 aliasDF.show();
}

DataFrame createData() {
Features f1 = new Features(1, category1);
Features f2 = new Features(2, category2);
ArrayList data = new ArrayList(2);
data.add(f1);
data.add(f2);
//JavaRDD rdd = 
javaSparkContext.parallelize(Arrays.asList(f1, f2));
JavaRDD rdd = javaSparkContext.parallelize(data);
DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
return df;
}

This is the output I got (without the spark log msgs)

root
 |-- id: integer (nullable = false)
 |-- labelStr: string (nullable = true)

+---++
| id|labelStr|
+---++
|  1|   noise|
|  2|questionable|
+---++

root
 |-- id: integer (nullable = false)

+---+
| id|
+---+
|  1|
|  2|
+---+




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-12-22 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068627#comment-15068627
 ] 

Maciej Bryński edited comment on SPARK-11437 at 12/22/15 7:45 PM:
--

[~davies], [~jason.white]
Are you sure that this patch is OK ?

Right now if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}


was (Author: maver1ck):
[~davies]
Are you sure that this patch is OK ?

Right now if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 1.6.0
>
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-12-22 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068627#comment-15068627
 ] 

Maciej Bryński edited comment on SPARK-11437 at 12/22/15 7:44 PM:
--

[~davies]
Are you sure that this patch is OK ?

Right now if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}


was (Author: maver1ck):
[~davies]
Are you sure that this patch is OK ?

Right now if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
from pyspark.sql.types import *
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
from pyspark.sql.types import *
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 1.6.0
>
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11437) createDataFrame shouldn't .take() when provided schema

2015-12-22 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068627#comment-15068627
 ] 

Maciej Bryński commented on SPARK-11437:


[~davies]
Are you sure that this patch is OK ?

Right now if I'm creating DataFrame from RDD of Rows there is no schema 
validation.
So we can create schema with wrong types.

{code}
from pyspark.sql.types import *
schema = StructType([StructField("id", IntegerType()), StructField("name", 
IntegerType())])
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(id=1, name=None)]
{code}

Even better. Column can change places.
{code}
from pyspark.sql.types import *
schema = StructType([StructField("name", IntegerType())]) 
sqlCtx.createDataFrame(sc.parallelize([Row(id=1, name="abc")]), 
schema).collect()

[Row(name=1)]
{code}

> createDataFrame shouldn't .take() when provided schema
> --
>
> Key: SPARK-11437
> URL: https://issues.apache.org/jira/browse/SPARK-11437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 1.6.0
>
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls 
> `.take(10)` to verify the first 10 rows of the RDD match the provided schema. 
> Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue 
> affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be 
> executed non-lazily. If necessary, I believe this verification should be done 
> lazily on all rows. However, since the caller is providing a schema to 
> follow, I think it's acceptable to simply fail if the schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12462:


Assignee: Apache Spark

> Add ExpressionDescription to misc non-aggregate functions
> -
>
> Key: SPARK-12462
> URL: https://issues.apache.org/jira/browse/SPARK-12462
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068572#comment-15068572
 ] 

Kyle Sutton edited comment on SPARK-6476 at 12/22/15 7:14 PM:
--

The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the _Spark_ service attempts to call back to the default IP 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
* A connection is made from the driver host to the _Spark_ service
** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
{{172.30.0.2}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the driver host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the driver host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

{code:title=Driver|borderStyle=solid}
SparkConf conf = new SparkConf()
.setMaster("spark://172.30.0.3:7077")
.setAppName("TestApp")
.set("spark.driver.host", "172.30.0.2")
.set("spark.driver.port", "50003")
.set("spark.fileserver.port", "50005");
JavaSparkContext sc = new JavaSparkContext(conf);
sc.addJar("target/code.jar");
{code}

{code:title=Stacktrace|borderStyle=solid}
15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
172.30.0.3): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.

[jira] [Assigned] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12462:


Assignee: (was: Apache Spark)

> Add ExpressionDescription to misc non-aggregate functions
> -
>
> Key: SPARK-12462
> URL: https://issues.apache.org/jira/browse/SPARK-12462
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12462) Add ExpressionDescription to misc non-aggregate functions

2015-12-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12462:


Assignee: Apache Spark

> Add ExpressionDescription to misc non-aggregate functions
> -
>
> Key: SPARK-12462
> URL: https://issues.apache.org/jira/browse/SPARK-12462
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6476) Spark fileserver not started on same IP as using spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068572#comment-15068572
 ] 

Kyle Sutton commented on SPARK-6476:


The issue of the file server using the default IP instead of the IP address 
configured through {{spark.driver.host}} still exists in _Spark 1.5.2_

The problem is that, while the file server is listening on all ports on the 
file server host, the _Spark_ service attempts to call back to the default port 
of the host, to which it may or may not have connectivity.

For instance, the following setup causes a {{java.net.SocketTimeoutException}} 
when the _Spark_ service tries to contact the _Spark_ driver host for a JAR:

* Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
connection IP of {{172.30.0.2}}
* _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
* A connection is made from the driver host to the _Spark_ service
** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
{{172.30.0.2}}
** {{spark.driver.port}} is set to {{50003}}
** {{spark.fileserver.port}} is set to {{50005}}
* Locally (on the driver host), the following listeners are active:
** {{0.0.0.0:50005}}
** {{172.30.0.2:50003}}
* The _Spark_ service calls back to the file server host for a JAR file using 
the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
* The _Spark_ service, being on a different network than the driver host, 
cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
file server
** A {{netstat}} on the _Spark_ service host will show the connection to the 
file server host as being in {{SYN_SENT}} state until the process gives up 
trying to connect

{code:title=Driver|borderStyle=solid}
SparkConf conf = new SparkConf()
.setMaster("spark://172.30.0.3:7077")
.setAppName("TestApp")
.set("spark.driver.host", "172.30.0.2")
.set("spark.driver.port", "50003")
.set("spark.fileserver.port", "50005");
JavaSparkContext sc = new JavaSparkContext(conf);
sc.addJar("target/code.jar");
{code}

{code:title=Stacktrace|borderStyle=solid}
15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
172.30.0.3): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.

[jira] [Commented] (SPARK-12482) Spark fileserver not started on same IP as configured in spark.driver.host

2015-12-22 Thread Kyle Sutton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068568#comment-15068568
 ] 

Kyle Sutton commented on SPARK-12482:
-

Thanks!  I did.  I think he's saying that fileserver is listening on all ports, 
but if the Spark service can't see the IP given it by the Spark driver, the 
ports are immaterial.

> Spark fileserver not started on same IP as configured in spark.driver.host
> --
>
> Key: SPARK-12482
> URL: https://issues.apache.org/jira/browse/SPARK-12482
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.5.2
>Reporter: Kyle Sutton
>
> The issue of the file server using the default IP instead of the IP address 
> configured through {{spark.driver.host}} still exists in _Spark 1.5.2_
> The problem is that, while the file server is listening on all ports on the 
> file server host, the _Spark_ service attempts to call back to the default 
> port of the host, to which it may or may not have connectivity.
> For instance, the following setup causes a 
> {{java.net.SocketTimeoutException}} when the _Spark_ service tries to contact 
> the _Spark_ driver host for a JAR:
> * Driver host has a default IP of {{192.168.1.2}} and a secondary LAN 
> connection IP of {{172.30.0.2}}
> * _Spark_ service is on the LAN with an IP of {{172.30.0.3}}
> * A connection is made from the driver host to the _Spark_ service
> ** {{spark.driver.host}} is set to the IP of the driver host on the LAN 
> {{172.30.0.2}}
> ** {{spark.driver.port}} is set to {{50003}}
> ** {{spark.fileserver.port}} is set to {{50005}}
> * Locally (on the driver host), the following listeners are active:
> ** {{0.0.0.0:50005}}
> ** {{172.30.0.2:50003}}
> * The _Spark_ service calls back to the file server host for a JAR file using 
> the driver host's default IP:  {{http://192.168.1.2:50005/jars/code.jar}}
> * The _Spark_ service, being on a different network than the driver host, 
> cannot see the {{192.168.1.0/24}} address space, and fails to connect to the 
> file server
> ** A {{netstat}} on the _Spark_ service host will show the connection to the 
> file server host as being in {{SYN_SENT}} state until the process gives up 
> trying to connect
> {code:title=Driver|borderStyle=solid}
> SparkConf conf = new SparkConf()
> .setMaster("spark://172.30.0.3:7077")
> .setAppName("TestApp")
> .set("spark.driver.host", "172.30.0.2")
> .set("spark.driver.port", "50003")
> .set("spark.fileserver.port", "50005");
> JavaSparkContext sc = new JavaSparkContext(conf);
> sc.addJar("target/code.jar");
> {code}
> {code:title=Stacktrace|borderStyle=solid}
> 15/12/22 12:48:33 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
> 172.30.0.3): java.net.SocketTimeoutException: connect timed out
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>   at sun.net.www.http.HttpClient.(HttpClient.java:211)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
>   at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
>   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> s

1 2 >

1 - 100 of 168 matches

Mail list logo